home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-05-05 | 217.2 KB | 6,018 lines |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Using the EXODUS Storage Manager V3.1
- (Last revision: November, 1993)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- ____________________
- The Exodus software was developed primarily with funds
- provided by by the Defense Advanced Research Projects Agency
- under contracts N00014-85-K-0788, N00014-88-K-0303, and DAABO7-
- 92-C-Q508 and monitored by the US Army Research Laboratory.
- Additional support was provided by Texas Instruments, Digital
- Equipment Corporation, and Apple Computer.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 1. INTRODUCTION
-
- The EXODUS Storage Manager is a multi-user object storage system
- supporting versions, indexes, single-site transactions, distri-
- buted transactions, concurrency control, and recovery. This docu-
- ment provides information about using version 3.1 of the EXODUS
- Storage Manager. Information about installing the Storage
- Manager can be found in the EXODUS Storage Manager Installation
- Manual. Section 2 gives an overview of the system. Section 3
- discusses configuration facilities. Section 4 describes, in
- detail, the Storage Manager's application interface. Section 5
- describes how to use the Storage Manager server. Appendices pro-
- vide more details on certain aspects of the system. A table of
- contents is located at the end of the document.
-
- 2. OVERVIEW OF THE EXODUS STORAGE MANAGER
-
- This section, an executive summary, briefly describes the archi-
- tecture of the Storage Manager and gives an overview of the
- facilities provided to applications,
-
- Version 3.1 of the Storage Manager runs on the following archi-
- tectures: Sun 4 (Sparc) (under SunOS 4.1.[23]), DecStation
- 3100/5000 (MIPS) (under Ultrix 4.2), and HP 720 (under HP-UX
- A.08.07). The Storage Manager is written in C++ and had been
- checked for compilation under the GNU C++ compiler (g++), version
- 2.3.3 and 2.4.5.
-
- 2.1. Architecture
-
- The EXODUS Storage Manager has a client-server architecture. An
- application program that uses the Storage Manager may reside on a
- machine different from the machine or machines on which the
- Storage Manager server or servers run. We use the term applica-
- tion to refer to programs that use the Storage Manager through
- the client programming interface described in Section 4. We use
- the term client library, or client, to refer to the Storage
- Manager code and data structures that are linked into the appli-
- cation program to support the client programming interface. The
- client allows applications to use the facilities described in the
- next sub-section. Each client has its own buffer pool for caching
- data. The client library connects to one or more server processes
- and communicates with them using a remote-procedure-call-style
- mechanism that runs over TCP.
-
- The Storage Manager server is a multi-threaded process providing
- asynchronous I/O, file, transaction, concurrency control, and
- recovery services to multiple clients. The server stores all data
- on volumes, which are either Unix files or raw disk partitions.
- The server is more completely described in Section 5 and in the
- EXODUS Storage Manager Architecture Overview [exoArch].
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 2.2. Facilities
-
- The EXODUS Storage Manager provides objects for storing data,
- versions of objects, files for grouping related objects, and
- indexes for supporting efficient object access. The Storage
- Manager also provides volumes, transactions, concurrency control,
- recovery, and configuration options. These facilities are
- presented briefly in this section, and more information can be
- found in later sections of the document.
-
- 2.2.1. Objects
-
- An object is an uninterpreted container of bytes, which can range
- in size from a few bytes to a little less than the size of a
- disk. Internally, the Storage Manager distinguishes two types of
- objects. There are small objects, which are objects that fit on
- a single disk page, and large objects, which are objects that do
- not fit on a single disk page. Support is also provided for
- creating and manipulating versions of both small and large
- objects. To provide a uniform function call interface, the dis-
- tinction between small, large, and versioned objects is hidden
- from applications. Applications are unaware of whether they are
- dealing with a small or large object, and the same interface
- functions are called to manipulate either type of object. To
- simplify the task of manipulating very large objects, the Storage
- Manager provides flexible buffer management that allows
- variable-length pieces of large objects to be buffered contigu-
- ously in the client buffer pool.
-
- Objects have object identifiers. The object identifier of a
- small object points directly to the object on disk, while the
- object identifier of a large object points to a large object
- header. The header of a large object serves as the root of a
- B[+]tree index structure that is used to access the object's data
- [Care86, Care89]. For space efficiency, a large object header can
- share a disk page with small objects and other large object
- headers. The data pages and the pages that make up the index
- structure of a large object are not shared, however. When a
- small object grows to the point where it can no longer be stored
- on a single page, the Storage Manager automatically converts it
- to a large object, leaving the new header in place of the origi-
- nal object.
-
- The Storage Manager provides functions to read, overwrite,
- insert, delete, and append to an object. Read requests specify an
- object identifier and a range of bytes. The desired data is read
- into a contiguous region in the client buffer pool (even if is
- distributed over several disk pages), and a pointer to the data
- is returned to the caller. The overwrite function uses the
- pointer set up by a read request, and overwrites a subrange of
- the data. The insert and delete functions allow data to be
- inserted into and deleted from objects at arbitrary offsets,
-
-
-
- 2
-
-
-
-
-
-
-
-
-
-
- while the append function allows data to be appended to the end
- of an object. As mentioned earlier, large objects are represented
- using a B[+]tree index structure. This ensures that each of the
- above operations can be executed efficiently on large objects.
-
- 2.2.2. Versions
-
- A version of an object is another object that appears to be a
- copy of the original object. A version of a small object is a
- copy of the original object. A version of a large object is an
- object header with a pointer into the original object's data,
- until either the version or the original object is updated. When
- the large object version is updated, the affected portions of the
- original object are copied to prevent the original object from
- being affected by the update [Care89]. Although the version sup-
- port described here is primitive, essentially providing "copy-
- on-write" objects, it has been purposefully designed that way so
- that a variety of application-specific versioning schemes can be
- implemented on top of the Storage Manager.
-
- 2.2.3. Files
-
- Objects are allocated in files, which are collections of related
- objects. Files have three uses.
-
- First, files are used for clustering objects. The objects in a
- file are stored on disk pages allocated solely to that file, so
- files provide a way to physically co-locate related objects on
- the disk.
-
- Second, the Storage Manager provides an efficient way to scan the
- objects in a file, visiting each object exactly once.
-
- Third, the Storage Manager offers an efficient mechanism for
- loading the objects into a file in bulk.
-
- 2.2.4. Indexes
-
- The Storage Manager provides B[+]tree indexes and linear hashing
- indexes. Index keys can be any basic C language data type or
- strings. Values can be any type of fixed length.
-
- 2.2.5. Volumes
-
- User data and Storage Manager meta-data (objects, files, indexes,
- and logs) are stored on volumes. A volume represents a disk,
- although in fact it may be a Unix raw disk partition or a Unix
- file.
-
- Volumes can be temporary, which means that data stored on them
- are not logged, and they do not persist from one transaction to
- the next. Temporary volumes are meant to provide fast storage
-
-
-
- 3
-
-
-
-
-
-
-
-
-
-
- for temporary data.
-
- 2.2.6. Transactions
-
- A transaction is a set of operations on objects, files, and
- indexes. Transactions are either committed or aborted. Updates
- made by committed transactions are guaranteed to be reflected on
- stable storage, even in the event of software or processor
- failure. Updates made by aborted transactions are not reflected
- on stable storage.
-
- Transactions that use data on more than one server are committed
- using a distributed two-phase commit protocol [Moha83].
-
- 2.2.7. Concurrency Control
-
- Concurrency control allow multiple client applications safely to
- use data simultaneously. Concurrency control is based on the
- standard hierarchical two-phase locking protocol providing
- degree-three consistency (see [Gray78, Gray88]). The lock hierar-
- chy contains two granularities: file-level, and page-level. Lock-
- ing for index operations is performed with a non-two-phase proto-
- col, which allows multiple clients to read and update the same
- index.
-
- Deadlocks involving more than one server are resolved through
- timeouts.
-
- 2.2.8. Recovery
-
- The Storage Manager recovers from software, operating system, and
- CPU failure by restoring data to a state in which all transac-
- tions have been committed or aborted. After an application
- fails, the transaction it is running is aborted by the servers
- that cooperated in the transaction. After a server fails and is
- restarted, updates made by committed transactions are restored,
- and updates by transactions in progress at the time of failure
- are undone. Recovery from media (disk) failure is not supported.
-
- 2.2.9. Configuration Options
-
- The Storage Manager client library and servers have configuration
- options, which can be set by users. These options control such
- things as parameters that affect performance and memory use, for-
- mats of volumes and logs, the choice of servers to be contacted
- by clients, and path names of installed executable files.
-
- 2.3. Illustration of Using the Storage Manager
-
- The purpose of this section is to give the reader a context in
- which to read the rest of this document. This section illus-
- trates a way to get started using the Storage Manager. There are
-
-
-
- 4
-
-
-
-
-
-
-
-
-
-
- many ways to install, configure, and use the Storage Manager;
- only the simplest way is illustrated here.
-
- This section uses an example application, "producer-consumer".
- The source code for the application programs is included in the
- Storage Manager software release, along with other example appli-
- cations.
-
- The producer program generates a series of transactions, each of
- which creates an object. The consumer program generates a series
- of transactions, each of which reads an object and destroys it.
- These programs were selected because they are relatively small,
- demonstrate the use of transactions, and show how to respond to
- server-initiated transaction failures and server failures.
-
- The remainder of this section gives specific directions for
- starting a server and running the example program. Detailed
- explanations of the steps are not given here; all the details are
- given elsewhere in this document.
-
- Installing the storage manager is akin to installing an operating
- system or a remote file system (but it's much simpler). You need
- to:
-
- (1) install the system's executable code, libraries, and
- include files;
-
- (2) prepare your disks for use;
-
- (3) configure your server so that it will use your disks, and
- so that it is otherwise tailored for your use;
-
- (4) compile and link your application programs to use the
- installed system;
-
- (5) configure your application programs' environment, run the
- programs, and
-
- (6) when you are finished, shut the system down.
-
- 2.3.1. Files Needed
-
- The following files are needed to use the Storage Manager:
-
- (1) libsm_client.a, the Storage Manager client library,
-
- (2) sm_client.h, the include file containing declarations of
- key data structures and constants,
-
- (3) sm_server, the executable file for the server portion of
- the Storage Manager,
-
-
-
-
- 5
-
-
-
-
-
-
-
-
-
-
- (4) diskrw, the executable file for the disk I/O processes
- used by the server process,
-
- (5) formatvol, and a utility program for formatting volumes,
-
- (6) .sm_config, configuration files for a server, the for-
- matter, and the application programs. One configuration
- file can be used for all programs, but it is sometimes
- easier to use configuration file for servers and the for-
- matter, and another for applications.
-
- These files can be installed anywhere; for the purpose of this
- section, we assume that they are all installed in your home
- directory, along with your application programs. (See the EXODUS
- Storage Manager Installation Manual to find the files in the
- Storage Manager software release.)
-
- 2.3.2. Preparing Your Disks
-
- The producer and consumer programs use a volume for storing their
- objects with a single server, and the server uses a log volume.
- The formatvol program is used to format a volume for use as
- either a data volume or a log volume. If you plan to use a raw
- disk partition for either volume, ask your system administrator
- for information on how to set up the device.
-
- The formats of the volumes must be described in the configuration
- file that formatvol reads. In the directory in which you plan to
- run formatvol, create a file called .sm_config that looks some-
- thing like this, with the appropriate substitutions:
-
-
- formatvol*logformat: /path/to/logfile: 9000: 1: 1: 1000: 8
- formatvol*dataformat: /path/to/datafile: 8000: 1: 1: 300
-
-
-
- Substitute the pathnames for files that you want to use for your
- log volume and data volume. With the options given above, the
- log volume will be given a volume identifier of 9000, and will
- consist of 1 cylinder of 1 track each, with 1000 blocks on each
- track, hence, 1000 blocks will be on the log. The log volume
- will use 8 Kbyte log pages. The data volume will be given a
- volume identifier of 8000, and will consist of 1 cylinder of 1
- track each, with 300 blocks on each track, hence, 300 blocks will
- be on the data volume.
-
- Now, run the formatter on volumes 9000 and 8000:
-
- formatvol -vol 9000 -vol 8000
-
-
-
-
-
- 6
-
-
-
-
-
-
-
-
-
-
- If you would like to see the information written on the volumes'
- headers, do this:
-
- formatvol -dis 9000 -dis 8000
-
-
- The formatter prints:
-
- VOLID 9000, version 3, is a LOG volume
- BLOCK SIZES: 8 K slotted, 8 K lg data, 8 K lg hdr
- 8 K btree, 8 K idesc
- LAYOUT: 1000 blk/trk; 1 trk/cyl; 1 cyl
- 1000 total blocks of 8 KB for 8192.000 KB
- FREE: 0 free, 1000 used
- BITMAP: 1 blk each, freemap @ 2, slotmap @ 4, filemap @ 5
- UNIQUE: start @ 3
- LOG: start @ 7, ctl blk @ 6, blk sz 8 K, #blks 993
- end of log @ dismount: LSN w=0.o=0, LRC w=0.c=1
- VOLID 8000, version 3, is a DATA volume
- BLOCK SIZES: 8 K slotted, 8 K lg data, 8 K lg hdr
- 8 K btree, 8 K idesc
- LAYOUT: 300 blk/trk; 1 trk/cyl; 1 cyl
- 300 total blocks of 8 KB for 2457.600 KB
- FREE: 294 free, 6 used
- BITMAP: 1 blk each, freemap @ 2, slotmap @ 4, filemap @ 5
- UNIQUE: start @ 3
-
-
-
- Now that you have formatted a log volume and a data volume, you
- are ready to start a server.
-
- 2.3.3. Configuring a Server
-
- Before you start a server, you need to create its configuration
- file. In the directory in which you expect to run the server,
- create a file called .sm_config that looks something like this,
- with the appropriate substitutions (in particular, for each
- occurrence of /path/to below):
-
-
- server*bufpages: 500
- # Portname need not be identical to log volume id.
- # This is just a convenience.
- server*portname: 9000
- server*diskproc: /path/to/diskrw
- server*logformat: /path/to/logfile: 9000: 1: 1: 1000: 8
- server*dataformat: /path/to/datafile: 8000: 1: 1: 500
- server*logvolume: 9000
-
-
-
-
-
-
- 7
-
-
-
-
-
-
-
-
-
-
- If the same configuration file is to be used for the formatter
- and the server, the format options can be made to be recognized
- by both:
-
-
- [sf]*[rl].logformat: /path/to/logfile: 9000: 1: 1: 1000: 8
- [sf]*[rl].dataformat:/path/to/datafile: 8000: 1: 1: 500
-
-
-
- Now you can start the server. Open a window in which to run the
- server, and, in the directory containing the server and its con-
- figuration file, start the server:
-
- sm_server
-
- The server is started on a newly formatted log volume, so it
- automatically regenerates the log. The server prints
-
- Server is ready for requests.
-
- when it can serve applications.
-
- 2.3.4. Compiling and Linking Your Application
-
- An application program must include the header file sm_client.h,
- which, in turn includes <stdio.h>, <setjmp.h>, <sys/types.h>,
- <netinet/in.h>. Applications can be compiled with a C or C++
- compiler.
-
- The client library is compiled with C++, so client programs must
- be linked with a C++ compiler. See the EXODUS Storage Manager
- Installation Manual for more information.
-
- 2.3.5. Configuring and Running Your Application
-
- The programs need configuration options to determine where to
- find the server that manages the data volumes they use, and to
- determine the sizes of the buffer pools they will use. In the
- directory in which you expect to run the application programs,
- Create a file called .sm_config that looks something like this,
- with the appropriate substitutions:
-
-
- # both producer and consumer will use
- # 250 page buffer pools:
- client*bufpages: 250
- # substitute the name or Internet address
- # of the host on which the server runs:
- client*mount: 8000 9000@serverhost
-
-
-
-
-
- 8
-
-
-
-
-
-
-
-
-
-
- Now you can run the producer and the consumer. It is easiest to
- create a window in which to run each program. The produce and
- consumer programs use the environment variable EVOLID to deter-
- mine the what volume to use. EVOLID must be set in each window.
-
- In window P:
-
- # producer <name> <#objects> <object size>
- setenv EVOLID 8000
- producer P 100 1000
-
- In window C:
-
-
- # consumer <name> <#objects>
- setenv EVOLID 8000
- consumer C 100
-
-
- The producer creates "#objects" objects and writes "name" in each
- one. The "object size" argument is the size of each object. The
- consumer reads and destroys "#objects" objects. It prints the
- sizes of the objects and their names. The "name" given to the
- consumer program is immaterial, but is helpful for reading the
- output when running more than one consumer.
-
- The two programs use a single root entry and a single file on the
- given volume. When a consumer has consumed the last object in a
- file, it destroys the file and removes the root entry. Each
- object is produced or consumed in a separate transaction. When
- both a producer and consumer are running concurrently, deadlocks
- occur periodically, since both are reading and writing the same
- file. When a deadlock occurs, the offending program aborts its
- transaction and tries again. Multiple producer and consumer pro-
- grams may be started. If the server fails or shuts down, the pro-
- ducer and consumer programs attempt to reconnect every five
- seconds, and when successful, they continue transaction process-
- ing.
-
- 2.3.6. Shutting Down the Server
-
- In the window in which the server runs, type the command:
-
- shutdown
-
-
- The server prints various messages, among them
-
-
- Clean shutdown: no recovery required on any volumes.
- All disk processes killed.
-
-
-
-
- 9
-
-
-
-
-
-
-
-
-
-
- when recovery is not required.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 10
-
-
-
-
-
-
-
-
-
-
- 3. CONFIGURATION OPTIONS AND CONFIGURATION FILES
-
- The client library, servers, and administrative programs use con-
- figuration options. All the options have a string name, a type,
- a set of possible values, a default value, and a current value.
- Client options can be set by a call to an application interface
- function or by a line in a configuration file. Server options can
- be set on the command line or by a line in the server's confi-
- guration file.
-
- Configuration files are Unix files, and are similar in format to
- the X Window system's resource files. Each line in a configura-
- tion file is an option command or a comment.
-
- A comment is a line that begins with "#" or with "!".
-
- An option command is a line containing an option descriptor,
- white space, and a string representing a value to assign to the
- option. An option descriptor consists of an option prefix fol-
- lowed immediately by an option name and a ":".
-
- The option prefix specifies the type and name of the program or
- programs for which the option is to be set. The program type is
- one of "client", "server", and "formatvol". The program name is
- usually the file name of the program, without its path (an appli-
- cation program can override this). The program type and program
- name are separated by ".". For example, the complete option
- descriptor for the option "bufpages" on the server named serverA
- is server.serverA.bufpages:.
-
- Wild card characters are allowed in the program type and name.
- The character "*" represents any portion of the prefix. The "?"
- character represents any program type or any program name. The
- expressions describing the program type and the program name are
- parsed by a regular expression handler, so complex expressions
- can be used. See the manual page for regex(3).
-
- The names of options can be abbreviated, as long as the abbrevia-
- tion unambiguously identifies a single option. (This is also true
- for options appearing on command lines.) Program types and names
- may not be abbreviated. Option name, program type, and program
- name matches are case-sensitive.
-
- Configuration options of type Boolean can be set with the Boolean
- values TRUE or FALSE, or with the strings "yes", "true", "no" or
- "false". The strings may be abbreviated and are not case-
- sensitive.
-
- Each setting of an option overrides any previous value for that
- option.
-
-
-
-
-
- 11
-
-
-
-
-
-
-
-
-
-
- Below, excerpts from configuration files illustrate ways to use
- the options.
-
-
- # log volumes for two servers, whose executable
- # file names are serverA and serverB
- server.serverA.logvolume: 1000
- server.serverB.logvolume: 2000
-
-
-
-
- # turn off progress printing for all servers
- server*progress: no
- # or
- server.?.progress: no
-
-
-
-
- ! all servers and clients have a 1000 page buffer pool
- *bufpages: 1000
- # The application foo uses a 500 page buffer pool.
- # (overriding the value of 1000, above)
- client.foo.bufpages: 500
- # Applications beginning with the letter g use 400 pages
- client.g*.bufpages: 400
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 12
-
-
-
-
-
-
-
-
-
-
- 4. THE STORAGE MANAGER APPLICATION INTERFACE
-
- The Storage Manager's application interface consists of a set of
- functions, macros, and variables. The Storage Manager software
- release contains the header file sm_client.h, in which are found
- the definitions for the macros and types that appear in this
- document. Function prototypes for the the Storage Manager func-
- tions are also found in sm_client.h. By convention, words that
- appear capitalized in the text are either C-preprocessor macros,
- or C- or C++- defined types, Functions definitions appear in bold
- face in the text. The rest of this section is divided into sub-
- sections describing error handling, initialization and shutdown,
- transactions, buffer management, operations on objects, opera-
- tions on versions, operations on files, operations on indexes,
- miscellaneous macros, and administrative functions.
-
- 4.1. Handling Errors
-
- Error handling is important to users wishing to write robust
- client applications. We discuss it first, since most Storage
- Manager functions return error codes. Although this issue is com-
- plex, some of the burden is lightened by the recovery facilities
- of the Storage Manager. In this section we focus on error codes
- and error messages.
-
- Almost all Storage Manager functions have integer return codes.
- All functions (except those used in printing error messages)
- return either esmNOERROR (zero), which represents success, or
- esmFAILURE (negative one), which represents an error. When an
- error occurs, the global variable sm_errno contains an error
- code. A small positive error code is an error code returned by
- Unix, as defined in <errno.h>. An error code greater than 65,536
- is an error returned by the Storage Manager, as defined in
- sm_client.h. The Storage Manager error codes have symbolic names
- (C preprocessor macros) that begin with esm. The value of
- sm_errno is not defined when the function returns esmNOERROR.
-
- Information about error codes can be obtained from the functions
- sm_Error( ), and sm_ErrorId( ), which are discussed below.
-
- Some errors cause a message to be printed to the file addressed
- by sm_ErrorStream. By default, this file is the standard error
- file, stderr, as defined in <stdio.h>, but the application can
- change it any time after the Storage Manager is initialized.
-
- Errors differ in severity and have different side effects. The
- most severe errors are fatal and cause the application to exit
- (the client library calls exit(3)). When the application exits,
- the servers abort the transaction, if a transaction is active.
- Fatal errors are caused by internal software problems in the
- Storage Manager. An example of a fatal error is esmMALLOCFAILED,
- which occurs when the entire data segment has been allocated by
-
-
-
- 13
-
-
-
-
-
-
-
-
-
-
- the application and client library, and the Storage Manager can-
- not proceed.
-
- Less severe errors cause the transaction to be aborted, but leave
- the application running. When this happens, sm_errno is given
- the value esmTRANSABORTED, and the client library also sets the
- global variable sm_reason. The range of values for sm_reason is
- the same as the range of values for sm_errno. (The value of
- sm_reason is meaningful only if sm_errno has the value esmTRANSA-
- BORTED, and it is unpredictable and meaningless otherwise.) When
- the server or the client library aborts a transaction and returns
- esmTRANSABORTED to the application, the transaction is only par-
- tially aborted. The application must complete the termination of
- the transaction by calling sm_AbortTransaction( ) (described in
- the Section 4.3.3, Transaction Operations).
-
- Less severe errors are generated by incorrect arguments to client
- interface functions or the lack of resources, such as buffer
- space. The application can correct the problem and retry the
- operation without aborting the transaction.
-
- Finally, some error codes indicate conditions that are not errors
- at all, such as esmEMPTYFILE, which is returned when an empty
- file is read.
-
- The following two functions can be used to print more information
- about the error.
-
-
- char *sm_Error (errorCode)
- int errorCode; /* error code returned by an sm function /*
-
-
- char *sm_ErrorId (errorCode)
- int errorCode; /* error code returned by an sm function /*
-
-
- These are the only Storage Manager functions that do not return
- an integer. When a client library function returns an error,
- sm_Error( ) can be called by the application to get a string that
- provides a brief description of the error. It also provides
- descriptions of Unix error codes. Sm_ErrorId( ) is used to
- return the string representation of the error code. For example,
- the call sm_ErrorId(esmBADOID) returns the string "esmBADOID",
- and the call sm_Error(esmBADOID) returns the string "invalid
- object id."
-
- If the client is disconnected from a server (by a server crash,
- network failure, etc.) the client library tries to reconnect to
- the server the next time it issues a request to the server. If
- the server in question is not available, the Storage Manager
- returns an error such as esmSERVERDIED or a Unix error such as
-
-
-
- 14
-
-
-
-
-
-
-
-
-
-
- ECONNREFUSED. While the server in question is doing recovery
- after a restart, esmTRANSDISABLED is returned. The server
- responds to requests when recovery is completed.
-
- 4.2. Initialization and Shutdown Operations
-
- Initialization and shutdown functions are used at the beginning
- and end of an application program, but most of them can be called
- at any time. The pertinent functions are sm_SetClientOption( ),
- sm_GetClientOption( ), sm_ParseCommandLine( ),
- sm_ReadConfigFile( ), sm_Initialize( ), and sm_ShutDown( ).
-
- Before initializing the Storage Manager client with
- sm_Initialize( ), a number of client configuration options must
- be set by the application. Options can be set through calls to
- sm_SetClientOption( ), sm_ParseCommandLine( ), or
- sm_ReadConfigFile( ). These options are summarized in Table 1.
- See Section 3 for information that applies to all options.
-
-
- ____________________________________________________________________________________
- Option Option Possible Default Option
- Name Type Values Values Description
- ____________________________________________________________________________________
- bufpages int > 4 none # pages in the buffer pool
- groups int > 3 20 # buffer groups
- userdesc int > 0 2000 # user descriptors
- mount string volid port@host none where to find server
- for this volume
- lognewpages Boolean yes,no,true,false no/false client logs new pages
- deallocpages Boolean yes,no,true,false yes/true removes empty pages
- pagelock string SH,EX SH default lock for pages
- traceflags int >= 0 0 set tracing flags
- locktimeout int >= 0 30 # 10-second intervals
- willing to await a lock
- ____________________________________________________________________________________
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Table 1: Client Options
-
-
- The "bufpages" option sets the size of the client buffer pool in
- 8 Kbyte pages (or n byte pages, for n=MIN_PAGESIZE; MIN_PAGESIZE
- is defined in sm_client.h). See Section 4.11.3, Tuning the
- Application for more information about setting this option.
-
- The "groups" option sets the limit on the number of buffer groups
- that can be opened at once. The default value is 20. See Section
- 4.6, Buffer Operations, for more information about buffer groups.
-
-
-
-
-
-
- 15
-
-
-
-
-
-
-
-
-
-
- The "userdescs" option sets the limit on the number of open user
- descriptors. The number of user descriptors should be set to the
- maximum number of simultaneous object references that are
- expected by the application program. The default value is 2000.
- See Section 4.7, Operations on Objects, for more information
- about user descriptors.
-
- The "lognewpages" option, if "yes", causes the client to generate
- log pages for newly allocated pages, and if "no", causes the
- server to generate the log pages. Setting this option to "no"
- results in fewer log records shipped to servers and usually
- lowers log space requirements for transactions that create
- objects. With rare patterns of use, setting "lognewpages" to
- "yes" results in better performance: if the objects that cause
- new pages to be allocated are small, and if enough work is done
- between object-creation operations to cause the newly allocated
- pages to be swapped, the preferred value for "lognewpages" is
- "yes". In general, it is difficult to predict which objects will
- be be created on newly allocated pages. The "lognewpages" option
- may be set only when a transaction is not active.
-
- The "deallocpages" option, if "yes", causes the client to deallo-
- cate pages that become empty after objects are destroyed. If the
- option's value is "no", these pages remain in the file, and do
- not get used again unless an appropriate near-hint is given when
- an object is subsequently created. Under most circumstances, the
- preferred value of "deallocpages" is "yes". If objects are
- created and destroyed in a LIFO fashion, and if the near-hint for
- object creation is NEAR_LAST, the preferred value is "no".
-
- The "pagelock" option changes the default lock mode for pages.
- See the Section 4.2, Initialization and Shutdown Operations, and
- Appendix A, Locking Protocol for Storage Manager Operations for
- information about using options.
-
- The "traceflags" option is used to turn on tracing, and is only
- available in a Storage Manager that was compiled with -DDEBUG.
- The "traceflags" option takes effect immediately and can be set
- at any time.
-
- The "mount" options indicate the locations of the volumes that
- the applications use. The "mount" option may be used more than
- once, to add new volumes to the client library's set of usable
- volumes, or to change the location of a volume. The option value
- consists of a volume's integer identifier, an Internet address,
- and a port at which can be found a server that manages the
- volume. The Internet addresses and port have format port @ host,
- where both the port and the host can be numeric or symbolic.
- Symbolic port names must be found in the services database used
- by getservbyname(3n), and symbolic host names must be in the host
- name database used by gethostbyname(3n). The following example
- shows three values for the "mount" option that accomplish the
-
-
-
- 16
-
-
-
-
-
-
-
-
-
-
- same thing in three ways. The volume 1000 is managed by the
- server listening on port 1152 (which is called "bounty" in the
- /etc/services database) on the local machine, whose Internet
- address is 128.105.2.153, also known as "pitcairn.isle.edu" to
- the host-name server.
-
- 1000 1152@128.105.2.153
- 1000 bounty@pitcairn.isle.edu
- 1000 1152@pitcairn.isle.edu.
- and
- 1000 bounty@128.105.2.153
-
-
- The host name localhost does not work if you are using distri-
- buted transactions (multiple cooperating servers).
-
- Volume identifiers must identify volumes unambiguously, across
- all servers.
-
- For each application or client, all the host names used for a
- given server must resolve to the same Internet address. Using
- the above example, this means that "128.105.2.153" and
- "pitcairn.isle.edu" are interchangeable. "Localhost", which
- resolves to the Internet address 127.0.0.1, is not interchange-
- able with "128.105.2.153" or "pitcairn.isle.edu", even though it
- addresses the same machine when used by a client on
- "pitcairn.isle.edu".
-
- It is acceptable to use two different servers running on a
- machine, by addressing them at different ports. This means that
-
- 1000 1151@pitcairn.isle.edu
- and
- 2000 1152@pitcairn.isle.edu
-
- can serve an application.
-
- The "locktimeout" option limits the time the server waits to
- acquire a lock on behalf of the client. The value represents a
- number of 10-second intervals. A value of zero means that the
- server does not wait at all, and if the lock cannot be acquired
- immediately, the client operation returns esmFAILURE, with
- esmLOCKBUSY in sm_errno. The option value can be changed at any
- time. The value that is in effect at the time a transaction
- makes its first request to a server is the value used for lock
- requests on that server for the duration of the transaction. See
- Appendix A, Section A.3, Deadlock Detection and Avoidance, for
- more information about locks. See also Section 4.4, Mounting and
- Dismounting Volumes, for information concerning the protocol
- between clients and servers.
-
-
-
-
-
- 17
-
-
-
-
-
-
-
-
-
-
- To support code that was written before the configuration option
- facility was added, the client library looks for the environment
- variable ESMCONFIG. If set, ESMCONFIG indicates a configuration
- file to read. The file is read using sm_ReadConfigFile( ), with
- its "programName" argument having the value NULL. It is read
- before any option is set, so all other functions that set options
- override those found in the ESMCONFIG file.
-
-
- sm_SetClientOption (optionName, optionValue, valueType)
- char *optionName; /* IN name of the option to set */
- void *optionValue; /* IN new value for the option */
- SMDATATYPE valueType; /* IN type of optionValue */
-
- Sm_SetClientOption( ) sets the option named "optionName" to the
- value in "optionValue". The "valueType" arguments indicates the
- type addressed by "optionValue". The supported types are SM_int,
- SM_Boolean, and SM_string. If "valueType" matches the type of the
- option as specified in Table 1, a simple assignment is done. If
- "valueType" is SM_string and the option has a different type, a
- conversion is performed.
-
-
- sm_GetClientOption (optionName, optionValue)
- char *optionName; /* IN name of the option to get */
- void *optionValue; /* OUT value for the option */
-
- Sm_GetClientOption( ) retrieves the value for "optionName" and
- returns it in "optionValue". It is assumed that the location
- addressed by "optionValue" matches the type, found in Table 1,
- for the option. For string-type options, the argument "option-
- Value" is treated as type "const char **pq. That is, it should
- contain the address of a pointer variable that is updated to
- point to a read-only buffer containing the option value.
-
-
- sm_ParseCommandLine (argc, argv, errorMsg)
- int *argc; /* IN/OUT number of command line arguments */
- char **argv; /* IN/OUT command line arguments */
- char **errorMsg; /* OUT syntax error message */
-
- Sm_ParseCommandLine( ) searches the command line, "argv", for any
- client options. Command-line options are prefixed by a "-". The
- value for the option must follow the option name. The Storage
- Manager ignores any command-line argument that is not recognized
- as a Storage Manager client option. If a client option is found,
- the name and value are removed from "argv" and "argc" is decre-
- mented by 2, even if there is an error in the option such as
- being given an illegal value. If there is an error processing
- any option, "errorMsg" is changed to point to an error message
- string.
-
-
-
-
- 18
-
-
-
-
-
-
-
-
-
-
- sm_ReadConfigFile (configFile, programName, errorMsg)
- char *configFile; /* IN name of the configuration file */
- char *programName; /* IN name of the application */
- char **errorMsg; /* OUT syntax error message */
-
-
- Sm_ReadConfigFile( ) reads the option configuration file "config-
- File", and sets the options indicated. If "configFile" is NULL,
- the default configuration files /usr/lib/exodus/sm_config,
- $HOME/.sm_config, and ./.sm_config are read in that order, if
- they exist. The name of the default configuration file
- /usr/lib/exodus/sm_config can be changed with a minor Storage
- Manager source code change described in the installation manual,
- EXODUS Storage Manager Installation Manual. The "programName"
- option gives the program name for matching with options in the
- configuration file. If "programName" is NULL and a previous call
- to sm_ReadConfigFile( ) had a non-NULL "programName", the previ-
- ous "programName" is used. If no previous call was made and a
- "programName" is not given, configuration file lines that contain
- a program name are not used; only generic entries, such as
- client.bufpages: 1000 and client*bufpages: 1000 are used.
-
- When an error occurs while reading the file, an error message is
- stored in "errorMsg" and esmFAILURE is returned, as with other
- Storage Manager functions. The "errorMsg" is describes syntax
- related errors in the configuration file.
-
- See Section 3 for information about the format of configuration
- files.
-
-
- sm_Initialize ( )
-
- Sm_Initialize( ) initializes the Storage Manager's data struc-
- tures. No Storage Manager functions except option and configura-
- tion file functions may be called before sm_Initialize( ) is
- called. Options that do not have defaults must be set before
- sm_Initialize( ) is called, otherwise esmFAILURE is returned,
- sm_errno is set to indicate what the problem is.
-
-
- sm_ShutDown ( )
-
- Sm_ShutDown( ) closes all the open buffer groups and frees the
- memory allocated at run-time by the client library. Once the
- client library has been shut down, it can used again by calling
- sm_Initialize( ). The client library loses track the information
- in the "mount" client options, so if sm_Initialize( ) is to be
- used again, the configuration files must be reread or the mount
- options must be reset with sm_SetClientOption( ).
-
-
-
-
-
- 19
-
-
-
-
-
-
-
-
-
-
- Figure 2 shows a simple "hello world" application for the Storage
- Manager. It sets configuration options, initializes the client
- library, and shuts down the client library. A more complete pro-
- gram would, begin transactions, perform operations on objects,
- files, and indexes. More sample programs are included with the
- software release.
-
-
-
-
-
- /*
- * "Hello world" program: demonstrates initialization and shutdown.
- */
- #include <stdlib.h>
- #include "sm_client.h"
-
- void ErrorCheck (int, char *);
-
- main(int argc, char** argv) {
- int e;
- char *errorMsg;
-
- e = sm_ReadConfigFile(NULL, argv[0], &errorMsg);
- if (e != esmNOERROR) {
- fprintf(stderr, "Configuration file error: %s", errorMsg);
- ErrorCheck(e, "sm_ReadConfigFile");
- exit(0);
- }
- e = sm_ParseCommandLine(&argc, argv, &errorMsg);
- if (e != esmNOERROR) {
- fprintf(stderr, "Command line error: %s", errorMsg);
- ErrorCheck(e, "sm_ParseCommandLine");
- exit(0);
- }
-
- e = sm_Initialize( ); ErrorCheck(e, "sm_Initialize");
- printf("Hello world!");
- e = sm_ShutDown( ); ErrorCheck(e, "sm_ShutDown");
- }
-
- void ErrorCheck (int e, char *func) {
- if (e < 0) {
- fprintf(stderr, "Storage Manager error \"%s\" in %s",
- sm_Error(sm_errno), func);
- exit(1);
- }
- }
-
- Figure 2: Example Program
-
-
-
-
-
- 20
-
-
-
-
-
-
-
-
-
-
- 4.3. Transactions
-
- The Storage Manager supports transactions, including concurrency
- control and recovery. Transactions may involve data managed by
- several Exodus Storage Manager servers, in which case a two-phase
- commit protocol, based on Presumed Abort [Moha83], determines the
- fate of the transaction when the application commits the transac-
- tion. The fact that such a transaction is distributed over
- several servers is invisible to the application. On the other
- hand, the Storage Manager (server or servers) can cooperate in a
- two-phase commit procedure with other transaction processing sys-
- tems when the external two-phase commit functions are used. The
- external two-phase commit functions also can be used explicitly
- to invoke the two phases for a transaction that involves only
- Exodus Storage manager servers. The external two-phase commit
- functions are described under "Advanced Topics", in Section
- 4.11.3, External Two-Phase Commit Functions,
-
- Object, file, index, and root entry operations must be performed
- within the scope of a transaction, or an error is returned. An
- application can run no more than one transaction at a time.
- Transactions cannot be nested, suspended, or resumed.
-
- In order to guarantee the semantics of transactions, operations
- on objects and files acquire locks. Appendix A describes the
- kinds of locks acquired by the client library functions.
-
- 4.3.1. Transaction Identifiers
-
- Each transaction has a local transaction identifier, which is
- assigned by the Storage Manager. The data type TID represents a
- transaction identifier. The application can treat a TID as an
- opaque value. The Storage Manager maintains a global variable,
- Tid, of type TID, which value the application can inspect, but
- had better not modify.
-
- The application can use the following two macros to give an ini-
- tial value to a transaction identifier, and to recognize that
- value.
-
- INVALIDATE_TID (TID tid)
-
-
- sets the "tid" argument to an invalid transaction identifier.
-
- TID_IS_INVALID (TID tid)
-
-
- returns TRUE if "tid" is the value given by INVALIDATE_TID( ),
- FALSE if not. TID_IS_INVALID( ) does not tell if there is an
- active transaction with the given transaction identifier.
-
-
-
-
- 21
-
-
-
-
-
-
-
-
-
-
- 4.3.2. Transaction States
-
- An application is always in one the following states: not running
- a transaction (INACTIVE), running a transaction (ACTIVE), running
- a transaction that has been (partially) aborted (ABORTED).
-
- An application is in the INACTIVE state until it calls
- sm_BeginTransaction( ), and after a call to
- sm_CommitTransaction( ) or sm_AbortTransaction( ).
-
- If the Storage Manager server or client library aborts a transac-
- tion, which sometimes happens because of an error on the part of
- the application, the application is in the ABORTED state until a
- call to sm_AbortTransaction( ). While in the ABORTED state, a
- call to any function other than sm_AbortTransaction( ) returns
- the error esmTRANSABORTED.
-
- 4.3.3. Transaction Operations
-
-
- sm_BeginTransaction (tid)
- TID *tid; /* OUT transaction ID */
-
- Sm_BeginTransaction( ) is called at the beginning of a transac-
- tion. The argument "tid" corresponds to a transaction identifier
- and is assigned by the Storage Manager.
-
- Sm_BeginTransaction( ) does not contact any servers or initiate a
- transaction with any server, since the operation has no arguments
- to indicate which servers are of interest. It only begins a
- transaction "locally". Once a transaction has begun locally, the
- client library initiates transactions on servers when data refer-
- ences so require.
-
-
- sm_CommitTransaction (tid)
- TID tid; /* IN transaction ID */
-
- Sm_CommitTransaction( ) is called to commit the effects of a
- transaction. If the commit succeeds, all changes made to data
- since the beginning of the transaction are guaranteed to be per-
- sistent, even in the event of system failure. See Section 4.9.1,
- Consistency Guarantees for Files, for more information about this
- guarantee. If the commit fails, an error is returned, and the
- transaction is aborted. When a transaction is committed, all user
- descriptors (see sm_ReadObject( ) ) are released. Buffer groups
- attached to the transaction (see sm_OpenBufferGroup( ) ) are
- closed.
-
-
- sm_AbortTransaction (tid)
- TID tid; /* IN transaction ID */
-
-
-
- 22
-
-
-
-
-
-
-
-
-
-
- Sm_AbortTransaction( ) aborts a transaction.
- Sm_AbortTransaction( ) releases all the user descriptors that
- were created during the transaction (see sm_ReadObject( ) ).
- Buffer groups attached to the transaction (see
- sm_OpenBufferGroup( ) ) are closed.
-
- The persistent data appear as if the transaction never began.
- The execution state of the application program is not affected by
- calling sm_AbortTransaction( ). The result is that the transient
- data in the program's address space do not match the state of the
- persistent data. The problem can be alleviated to some degree by
- judicious use of setjmp(2), longjmp(2), and lexical scoping in
- the application program. The following macros, which are defined
- in sm_client.h, do that:
-
-
- SM_BEGIN_TRANSACTION (tid, abortCode)
- TID *tid; /* transaction ID */
- int abortCode; /* location to store abort code */
-
- SM_BEGIN_TRANSACTION begins a transaction block (i.e. it opens a
- new lexical scope in C or C++). The transaction ID is placed in
- "tid". The argument "abortCode" must be a variable. This vari-
- able can be checked at the end of the transaction to determined
- if it was aborted.
-
-
- SM_COMMIT_TRANSACTION (tid)
- TID tid; /* transaction ID */
-
- SM_COMMIT_TRANSACTION ends a transaction block. When this state-
- ment is executed, the transaction is committed, assuming no error
- occurs during commit. Immediately after the SM_COMMIT_TRANSACTION
- statement, the "abortCode" variable given in the
- SM_BEGIN_TRANSACTION statement should be checked to see if any
- error occurred. If no error occurred, "abortCode" is set to
- esmNOERROR. Otherwise, "abortCode" is set to the value given in
- SM_ABORT_TRANSACTION.
-
-
- SM_ABORT_TRANSACTION (abortCode)
- int abortCode; /* error to return on abort */
-
- SM_ABORT_TRANSACTION aborts the active transaction (i.e.
- sm_AbortTransaction( ) is called) and resumes execution at the
- line immediately following the SM_COMMIT_TRANSACTION statement
- for the transaction. The SM_ABORT_TRANSACTION macro does not need
- to be called within the lexical scope of the transaction block.
- It can be called in any function operating in the dynamic scope
- of the transaction. The "abortCode" argument sets the
- "abortCode" variable given in SM_BEGIN_TRANSACTION.
-
-
-
-
- 23
-
-
-
-
-
-
-
-
-
-
- When a SM_ABORT_TRANSACTION is called, the program's control is
- transferred to the program point after the SM_COMMIT_TRANSACTION
- statement. The stack pointer is restored to the level of the
- transaction block, so functions on the program's stack after it
- are not completed. For C++, this means that destructors are not
- called for any local variables in those functions.
-
- Examples of using both the transaction macros and functions can
- be found in the producer-consumer example given in the Storage
- Manager software release.
-
- 4.4. Mounting and Dismounting Volumes
-
- An application program does not need to mount and dismount
- volumes explicitly. In most cases, the client library automati-
- cally mounts a volume when the application makes its first refer-
- ence to that volume. An application that does not explicitly
- mount a volume may, when it performs its first operation on an
- object, find that the server for that object is not running.
- Writing programs to handle such common errors can be difficult,
- so it may be more convenient to mount volumes before proceeding
- with operations on data. Sm_MountVolume( ) serves that purpose.
- If that server has not yet been contacted, sm_MountVolume( )
- establishes a connection to the server and mounts the volume. It
- does not begin a transaction. (See Section 4.3.3, Transaction
- Operations to understand how transactions are begun.)
-
- When an application exits or calls sm_ShutDown( ), connections to
- servers are severed, and the servers dismount the volumes used by
- the application. A server severs its connections and dismounts
- the volumes if an application is inactive for a significant time.
- An application is inactive if it has no transaction running.
-
- An application can dismount volumes explicitly, causing the
- volumes to be dismounted at the server. An application that con-
- tinues to run after it is finished using the Storage Manager
- would do well to use sm_ShutDown( ). If it is inappropriate to
- use sm_ShutDown( ), but such an application is finished with a
- set of volumes, it would do best to dismount the volumes, partic-
- ularly if the volumes are likely to be reformatted.
-
-
- sm_MountVolume ( volid )
- VOLID volid; /* IN volume to mount */
-
-
- Sm_MountVolume( ) causes the volume identified by "volid" to be
- mounted. A side effect of the operation is that the client
- library has established a connection with the server that manages
- this volume.
-
-
-
-
-
- 24
-
-
-
-
-
-
-
-
-
-
- If the volume cannot be mounted, sm_MountVolume( ) returns
- esmFAILURE and a value in sm_errno that describes the reason:
- esmNOSUCHVOLUME (the client library cannot identify the server
- for this volume because there is no "mount" option for this
- volid), esmTRANSABORTED (the transaction was aborted during the
- previous operation, and the next thing the application must do is
- abort the transaction), esmSERVERDIED (connection with server was
- severed during the mount operation), or any Unix error message
- from <errno.h> (such as ENETDOWN and ECONNREFUSED), which indi-
- cate that the server is not running or is unreachable through the
- network.
-
-
- sm_DismountVolume ( volid )
- VOLID volid; /* IN volume to dismount */
-
-
- The "volid" argument identifies the volume to be dismounted. If
- the volume is not mounted, the operation returns esmFAILURE, and
- the client library returns esmBADVOLID in sm_errno.
-
- 4.5. Root Entries
-
- The root entry facility is designed for applications to get a
- handle to data on a volume. [1] A common use of a root entry is
- to associate a string name with an object identifier for an
- object containing information about the contents of the volume.
- For example, in a database system, this might be the object iden-
- tifier for the catalog.
-
- A root entry is a string and data pair stored in a special loca-
- tion on a volume, called the root area. The string, called the
- name, is used to identify the entry. The name string must be
- null-terminated. The maximum lengths of the name (including the
- terminating null) and data are defined by MAX_ROOTNAME_SIZE and
- MAX_ROOTDATA_SIZE respectively. An error is returned if the
- available number of root entries is exceeded. Names and data are
- limited to 32 bytes each, and approximately 90 root entries can
- reside in a volume's root area.
-
-
- sm_SetRootEntry (volid, name, data, dataLength)
- VOLID volid; /* IN volume identifier */
- char *name; /* IN name to store data entry under */
- void *data; /* IN data entry to be stored */
- int dataLength; /* IN length of the data */
-
-
- ____________________
- [1] Root entries cannot be created on temporary volumes.
-
-
-
-
-
- 25
-
-
-
-
-
-
-
-
-
-
- Sm_SetRootEntry( ) is creates or updates an entry. The "name"
- argument is the name of the entry and the "data" argument is the
- data to be stored. The number of bytes in the data is given in
- "dataLength". For example, to store the contents of the variable
- "rootOid" under the name "root-obj", use sm_SetRootEntry(volid,
- "root-obj", (char*) &rootOid, sizeof(rootOid)).
-
- Sm_SetRootEntry( ) obtains an exclusive lock on the root area of
- the volume, so updates to root entries should be performed in a
- short transaction.
-
-
- sm_GetRootEntry (volid, name, data, dataLength)
- VOLID volid; /* IN volume identifier */
- char *name; /* IN name of the entry */
- void *data; /* OUT data stored under name */
- int *dataLength; /* IN/OUT length of the data */
-
- Sm_GetRootEntry( ) retrieves the root entry named "name". The
- data is placed in "data" and the length of the data is returned
- in "dataLength". If "dataLength" is initialized with a value
- greater than or equal to zero, the maximum number of bytes copied
- to "data" is "dataLength". If "dataLength" is initialized with a
- value less than zero, the entire length of the data is copied to
- "data".
-
- Sm_GetRootEntry( ) obtains a share lock on the root area of the
- volume. This share lock blocks other transactions from updating
- or removing root entries until the transaction is committed or
- aborted. If no root entry exists for "name", esmFAILURE is
- returned and sm_errno is set to esmBADROOTNAME.
-
-
- sm_RemoveRootEntry (volid, name)
- VOLID volid; /* IN volume identifier */
- char *name; /* IN name of entry */
-
-
- Sm_RemoveRootEntry( ) removes the root entry stored under "name".
- Sm_RemoveRootEntry( ) obtains an exclusive lock on the root area
- of the volume, so removal of root entries should be performed in
- a short transaction.
-
- 4.6. Buffer Operations
-
- The Storage Manager buffer manager implements the concept of a
- buffer group, as proposed in the DBMIN buffer management algo-
- rithm [Chou85]. The essence of the DBMIN algorithm is that com-
- peting uses of the buffer pool may be allocated their own
- buffers, to minimize competition for the buffers and to eliminate
- thrashing in the buffer pool.
-
-
-
-
- 26
-
-
-
-
-
-
-
-
-
-
- All uses of the buffer pool are made through a buffer group. A
- buffer group is a container of page buffers, with a limit on the
- number of fixed pages it can contain. Fixed pages are guaranteed
- to remain in the buffer pool until they are unfixed. Their loca-
- tions (virtual addresses) may change, but the pages remain in the
- virtual address space of the buffer pool. Each buffer group has
- a replacement policy, which controls the replacement of unfixed
- pages within the buffer group.
-
- Buffer groups can be opened and closed at any time, whether or
- not a transaction is running. If a buffer group is opened in a
- transaction, it may be "attached" to the transaction, which means
- that the buffer group is closed by the client library when the
- transaction ends. An attached buffer group can be closed expli-
- citly by the application before the transaction ends.
-
- The following two macros can be used with buffer groups to give
- an initial value to a buffer group index and to recognize that
- value.
-
- INVALIDATE_BUFGROUP (int bufgroup)
-
-
- sets the "bufgroup" argument to an invalid buffer group index.
-
- BUFGROUP_IS_INVALID (int bufgroup)
-
-
- returns TRUE if "bufgroup" is the value given by
- INVALIDATE_BUFGROUP( ), FALSE if it is not.
- BUFGROUP_IS_INVALID( ) does not tell if there exists a buffer
- group with the given index.
-
-
- sm_OpenBufferGroup (groupSize, policy, groupIndex, flags)
- int groupSize; /* IN the maximum group size in pages */
- int policy; /* IN the group's replacement policy */
- int *groupIndex; /* OUT the group's index */
- FLAGS flags; /* IN buffer group attributes */
-
-
- Sm_OpenBufferGroup( ) opens a new buffer group. The "groupSize"
- argument specifies the size of the buffer group in MIN_PAGESIZE
- pages. The sum of the sizes of all open buffer groups cannot
- exceed the size of the buffer pool. (See Section 4.11.3, Tuning
- the Application.) The choice for "policy" is least-recently-used
- (BF_LRU) or most-recently-used (BF_MRU). BF_LRU and BF_MRU are
- defined in sm_client.h. The argument "groupIndex" is filled by
- the Storage Manager and must be used in subsequent references to
- the buffer group. (All operations on files and objects require a
- buffer group index.)
-
-
-
-
- 27
-
-
-
-
-
-
-
-
-
-
- The "flags" indicates whether the buffer group is to be associ-
- ated with a transaction. NOFLAGS indicates that it is not.
- TRANS_GROUP indicates that the buffer group is associated with
- the current transaction. The group is closed by the client
- library when the active transaction ends. If TRANS_GROUP is used,
- a transaction must be running at the time sm_OpenBufferGroup( )
- is called.
-
- The effect of sm_OpenBufferGroup( ) is to reserve "groupSize"
- pages in the client's buffer pool. No buffer group is opened on
- the server.
-
-
- sm_BufferGroupInfo (groupIndex, maxPages, fixedPages, unfixedPages)
- int groupIndex; /* IN the group to inspect */
- int *maxPages; /* OUT max fixed pages allowed */
- int *fixedPages; /* OUT current # of pages fixed */
- int *unfixedPages; /* OUT current # of pages unfixed */
-
-
- Sm_BufferGroupInfo( ) returns information about the open buffer
- group identified by "groupIndex". The function returns the
- buffer group's size limit in pages in "maxPages". In "fix-
- edPages", it returns the number of pages currently fixed in the
- buffer group. See the next section for more information about
- these functions. The argument "unfixedPages" refers to all
- buffer pages that belong to the buffer group, but are not fixed,
- that is these pages may be removed from the buffer pool if space
- is needed for fixed pages.
-
-
- sm_CloseBufferGroup (groupIndex)
- int groupIndex; /* IN the group being closed */
-
- Sm_CloseBufferGroup( ) closes the open buffer group identified by
- "groupIndex".
-
-
- 4.7. Operations on Objects
-
- An object in the Storage Manager is a container of bytes. It can
- be empty. It can have as many as 2[31] bytes, if the volume on
- which it resides is large enough. An object must fit on a single
- volume (storage device or partition). When an object is created,
- the Storage Manager gives the object a unique object identifier.
- An object identifier is described by a structure of the type OID,
- defined as follows:.
-
-
-
-
-
-
-
-
- 28
-
-
-
-
-
-
-
-
-
-
- typedef struct {
- SHORTPID pid; /* 32-bit page address of the object's header */
- SLOTINDEX slot; /* 16-bit slot number of the object on the page */
- VOLID volid; /* 16-bit identifier of the volume */
- UNIQUE unique; /* 32-bit number generated at creation time */
- } OID;
-
-
- The first three fields of an OID are the physical address of the
- object; they identify a volume, a page within the volume, and a
- slot on the page. An object's identifier never changes. The
- client library sometimes moves objects, such as when an object
- grows beyond the size of a page, at which time the object is
- marked as forwarded, but its OID remains unchanged.
-
- The "unique" field of an OID is special 32-bit value that is gen-
- erated when the object is created and used to detect dangling and
- corrupted OIDs. The generation of unique numbers is discussed in
- Appendix B.
-
- Every time an object is accessed by its OID, the Storage Manager
- validates the OID. The application can use the following macros
- to give an illegitimate initial value to an OID, and to recognize
- that value:
-
- INVALIDATE_OID (OID oid)
-
- sets the "oid" argument to an invalid object identifier.
-
- OID_IS_INVALID (OID oid)
-
- returns TRUE if "oid" is the value given by INVALIDATE_OID( ),
- FALSE if it is not.
-
- Each object has an object header, which describes the object, and
- which can be retrieved without retrieving the object's data. The
- structure of an object header is shown below:
-
- typedef struct {
- TWO properties; /* a bit vector */
- TWO tag; /* supplied by the application */
- int size; /* size of the object in bytes */
- } OBJHDR;
-
-
- The "tag" is a two-byte field that the Storage Manager does not
- interpret. It is for use by the application. No restriction is
- put on the contents of "tag" fields. As its name implies, the
- "size" field is the size of the object in bytes. The "proper-
- ties" field is a read-only bit-vector that indicates the presence
- or absence of the following properties of objects:
-
-
-
-
- 29
-
-
-
-
-
-
-
-
-
-
-
- P_LARGEOBJ set if the object is a large object.
-
- P_MOVED set if this object has been forwarded
- to another page.
-
- P_FROZEN set if the object is a frozen version.
-
- P_VERSIONED set if the object is a frozen version
- or a descendent of a frozen version.
-
-
- Each object resides in a file on a volume. When an object is
- created, the application tells the client library in which file
- to place the object. Files and their uses are discussed in the
- next section; details of their use are not pertinent to under-
- standing the operations on objects.
-
- Before an operation can be performed on an existing object, the
- object, or at least the affected parts of the object, must be
- brought into the application's address space. This is called
- pinning the object or its parts. When the object is no longer
- needed, it must be unpinned, to make room for other objects to be
- pinned[2]. When the client library pins and object in order to
- perform an operation on behalf of the application (for example,
- appending bytes to an object), the client library pins the neces-
- sary parts of the object and unpins them before it returns con-
- trol to the application. When the application pins part of an
- object for its own purposes (such as writing over bytes in the
- object), the pinned part is placed in the client's buffer pool,
- and the client library creates a "handle" for the the object. The
- handle is called a user descriptor. The application can refer to
- an object only through user descriptors. The application must
- unpin the object by releasing the user descriptor when it is done
- using the object.
-
- A user descriptor is called valid if and only if the byte range
- it addresses is pinned. An application can pin an object or
- overlapping parts of an object any number of times, having any
- number of valid user descriptors for the same data in an object.
- (This is not wise for performance reasons, but it can be done.)
-
- The client library functions that pin ranges of bytes return user
- descriptors to describe the bytes pinned. Functions that require
- that the range of bytes they affect be pinned take user descrip-
- tors as input arguments. The client library functions that do
- not take user descriptor arguments do not ultimately change the
- ____________________
- [2] Objects are pinned; pages are fixed. The gist of the two
- verbs is the same.
-
-
-
-
-
- 30
-
-
-
-
-
-
-
-
-
-
- quantity of bytes pinned or the number of pages fixed in the
- buffer pool. Such functions may change the ranges of bytes
- addressed or the bytes themselves, but they do not change the
- quantity of bytes addressed. (For example, the function
- sm_InsertInObject( ) may affect valid user descriptors even
- though it does not take and user descriptors as arguments.)
-
- User descriptors have the following form:
-
- typedef struct {
- char *basePtr; /* ptr to start of data */
- int byteCount /* number of bytes accessible */
- int objectSize; /* total size of object */
- TWO userFlags; /* properties field from object header */
- TWO type; /* for use only by E */
- TWO flags; /* for use only by E */
- TWO tag; /* tag field from the object header */
- OID oid; /* oid of object being referenced */
- } USERDESC;
-
-
- The "basePtr" field of a user descriptor points to the start of
- the object's data in the buffer pool, while the "byteCount" field
- indicates the number of bytes accessible to the application pro-
- gram through this user descriptor. The value "objectSize" is the
- length of the entire object. The "userFlags" field holds a copy
- of the properties field from the object's header. The "type" and
- "flags" fields are used by the E language's persistent virtual
- machine. Finally, the "tag" field contains a copy of the "tag"
- field in the object's header.
-
- An object's data is referenced indirectly via the "basePtr"
- field. References by the application must always be indirect via
- "basePtr". The indirection is necessary because there are times
- when the Storage Manager moves an object in the buffer pool, and
- the "basePtr" of each user descriptor that references the object
- is updated to account for the move.
-
- The remainder of this section describes the Storage Manager func-
- tions for operating on objects. It is divided into sub-sections
- that describe creating and destroying objects, pinning and unpin-
- ning parts of objects, modifying objects, and using object
- headers.
-
- 4.7.1. Creating and Destroying Objects
-
-
-
-
-
-
-
-
-
-
- 31
-
-
-
-
-
-
-
-
-
-
- sm_CreateObject (groupIndex, fid, nearHint, nearObj, objHdr, length, data, oid)
- int groupIndex; /* IN buffer group to use */
- FID *fid; /* IN file in which object is to be placed */
- int nearHint; /* IN flag indicating where to create the new object */
- OID *nearObj; /* IN create the new object near this object */
- OBJHDR *objHdr; /* IN the object's header */
- int length; /* IN amount of data */
- void *data; /* IN the initial data for the object */
- OID *oid; /* OUT the new object's OID */
-
-
- Sm_CreateObject( ) creates an object in the file identified by
- "fid". If "objHdr" is not NULL, the "tag" field in the header of
- the new object is initialized with the contents of the "tag"
- field in the header structure addressed by "objHdr". When "data"
- is not NULL, the object is initialized with the data addressed by
- the argument "data"; in this case, "length" specifies how much
- data to copy. When "data" is NULL, an object of size "length" is
- created and filled with zeroes.
-
- The argument "nearHint" specifies where the new object should be
- created. The following values, defined in sm_client.h, are near
- hints: NEAR_OBJ, NEAR_FIRST, and NEAR_LAST. If "nearHint" is set
- to NEAR_OBJ, the new object is created near the object designated
- by "nearObj". If "nearHint" is set to NEAR_FIRST or NEAR_LAST,
- "nearObj" is ignored and the new object is created near the first
- or last object in the file, respectively.
-
- If sm_CreateObject( ) is successful, the OID structure pointed to
- by "oid" is filled with the OID of the new object.
- Sm_CreateObject( ) does not leave the new object pinned.
-
- sm_DestroyObject (groupIndex, oid)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN the object to destroy */
-
-
- Sm_DestroyObject( ) destroys an object. If any user descriptors
- are valid for the object when the object is destroyed, they are
- made invalid, and they must be released with sm_ReleaseObject( ),
- described below.
-
- 4.7.2. Pinning and Unpinning Objects
-
-
- The following two functions change the number of pages fixed in
- the client buffer pool. All the other functions that operate on
- objects fix pages temporarily and unfix the pages before return-
- ing.
-
-
-
-
-
-
- 32
-
-
-
-
-
-
-
-
-
-
- sm_ReadObject (groupIndex, oid, start, length, userDesc)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN object to read */
- int start; /* IN starting offset of read */
- int length; /* IN amount of data to read */
- USERDESC **userDesc; /* OUT descriptor to access the data */
-
-
- Sm_ReadObject( ) reads part or all of the object identified by
- "oid" into the buffer group identified by "groupIndex". If
- "length" is READ_ALL, the entire object is read (assuming that
- the size of the entire object is not greater than the amount of
- unpinned space in the buffer group). Otherwise, the bytes to be
- read are specified by "start" and "length".
-
- Sm_ReadObject( ) pins the specified range of bytes in the buffer
- pool and returns a user descriptor to the caller. Bytes pinned
- in the buffer pool by sm_ReadObject( ) remain pinned until they
- are explicitly released by sm_ReleaseObject( ).
-
- While sm_ReadObject( ) can be used to get information about the
- object (from the object header) by giving it a length of zero,
- sm_ReadObjectHeader( ) is the preferred way to meet the same
- objective. Sm_ReadObject( ) performs work that is unnecessary
- when only the object header is of interest, and it always fixes
- at least one page in the buffer pool, even if the given length is
- zero.
-
- The user descriptor consumes resources that must be freed with
- sm_ReleaseObject( ), even if the object is not pinned (zero is
- given for "length").
-
-
- sm_ReleaseObject (userDesc)
- USERDESC *userDesc; /* IN descriptor returned by ReadObject */
-
-
- Sm_ReleaseObject( ) unpins a range of bytes of an object that was
- pinned by sm_ReadObject( ), and frees the resources associated
- with the user descriptor. If the user descriptor is not valid,
- sm_ReleaseObject( ) sets sm_errno to esmBADUSERDESC and returns
- esmFAILURE.
-
- 4.7.3. Modifying Objects
-
- Four functions modify objects: sm_WriteObject( ),
- sm_InsertInObject( ), sm_AppendToObject( ), and
- sm_DeleteFromObject( ). Sm_WriteObject( ) cannot be used to
- change the size of an object, only to overwrite parts of an
- object. The other three functions can change the size of an
- object. These functions provide substantial flexibility, and
- their efficiency varies. Changing the size of a small object
-
-
-
- 33
-
-
-
-
-
-
-
-
-
-
- (one that fits on a disk page) is relatively inexpensive. It is
- less expensive than reading and writing the object. For large
- objects, performing many small-size changes can be expensive in
- CPU time and buffer space utilization. If a large object is
- pinned several times simultaneously, through different user
- descriptors, updates to the object are very expensive. If a
- large number of small-size changes is required, we recommend
- accumulating the changes and performing them in larger chunks.
-
-
- sm_WriteObject (groupIndex, start, length, data, userDesc, release)
- int groupIndex; /* IN buffer group in use */
- int start; /* IN starting offset of write */
- int length; /* IN amount of data to be written */
- void *data; /* IN pointer to the data */
- USERDESC *userDesc; /* IN descriptor returned by ReadObject */
- BOOL release; /* IN whether to release the object */
-
-
- Sm_WriteObject( ) overwrites the region of bytes from (userDesc-
- >baseptr + start) to (userDesc->baseptr + start + length - 1)
- with the data addressed by the "data" argument. The given byte
- range must have been pinned (which means that the user descriptor
- must be valid). If "release" is TRUE, the range of bytes given
- by "userDesc" is unpinned when sm_WriteObject( ) returns. If
- "data" is NULL, the region is filled with zeroes. All updates to
- objects must be performed using sm_WriteObject( ) so that the
- updates can be logged, and the transaction semantics can be
- guaranteed.
-
-
- sm_InsertInObject (groupIndex, oid, start, length, data)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN object we're inserting into */
- int start; /* IN starting offset of insert */
- int length; /* IN amount of data being inserted */
- void *data; /* IN data to insert */
-
-
- Sm_InsertInObject( ) inserts "length" bytes of data into an
- object, beginning at the offset "start". If "data" is NULL, the
- inserted region is filled with zeroes. If there are any valid
- user descriptors (those for which sm_ReleaseObject( ) has not
- been called) for the object at the time the insertion takes
- place, they are reestablished if necessary. After the insertion,
- the base pointers of the valid user descriptors point to the byte
- within the object indicated by the "start" argument to the
- sm_ReadObject( ) operation that created the user descriptor. For
- example, an object's first five bytes, "ABCDE" are pinned by
- sm_ReadObject( ), which was called with a "start" offset of zero
- and a "length" of five. Sm_ReadObject( ) returns a user descrip-
- tor, U, which addresses "ABCDE". Sm_InsertInObject( ) inserts
-
-
-
- 34
-
-
-
-
-
-
-
-
-
-
- "ZZ" at "start" offset zero. The user descriptor U now addresses
- "ZZABC", which are pinned, while the bytes "DE" are no longer
- pinned.
-
-
- sm_AppendToObject (groupIndex, oid, length, data)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN object we are appending data to */
- int length; /* IN amount of data being appended */
- void *data; /* IN data to append */
-
-
- Sm_AppendToObject( ) appends "length" bytes of data to the end of
- an object. Outstanding user descriptors are handled the same way
- as sm_InsertInObject( ). If "data" is NULL, the appended region
- is filled with zeroes.
-
-
- sm_DeleteFromObject (groupIndex, oid, start, length)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN object we're inserting into */
- int start; /* IN starting offset of delete */
- int length; /* IN amount of data being deleted */
-
-
- Sm_DeleteFromObject( ) deletes "length" bytes of data from an
- object, beginning with the byte indicated by the offset "start".
- Sm_DeleteFromObject( ) is analogous to sm_InsertObject( ). All
- valid user descriptors affected by the deletion are, if possible,
- reset to point to the new absolute offset within the object.
- There are two cases when this is not possible.
-
- (1) The object's size is now smaller than the starting offset
- of a user descriptor. The "basePtr" field in the user
- descriptor is set to NULL and the user descriptor is made
- invalid. The user descriptor must be released by
- sm_ReleaseObject( ) so that its resources can be
- reclaimed.
-
- (2) The object's size is now smaller than the original byte
- range addressable by a user descriptor. The size of the
- range addressable by the descriptor is reduced to reflect
- the new size of the object.
-
- 4.7.4. Object Headers
-
-
- sm_ReadObjectHeader (groupIndex, oid, objHdr)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN read this object's header */
- OBJHDR *objHdr; /* OUT place to put the header */
-
-
-
-
- 35
-
-
-
-
-
-
-
-
-
-
- Sm_ReadObjectHeader( ) reads an object's header into the struc-
- ture addressed by "objHdr". This function is the preferred one
- to use to determine if an object's identifier is valid. If the
- object's identifier is invalid, Sm_ReadObjectHeader( ) returns
- esmFAILURE and puts esmBADOID in sm_errno.
-
-
- sm_SetObjectHeader (groupIndex, oid, objHdr)
- int groupIndex; /* IN buffer group in use */
- OID *oid; /* IN set this object's header flags */
- OBJHDR *objHdr; /* IN the new header */
-
-
- Sm_SetObjectHeader( ) modifies an object's header. Only the
- "tags" field is modified; the other fields are read-only.
-
- 4.8. Versions of Objects
-
- In order to allow efficient updating of shared data, the Storage
- Manager offers versions of objects. Versions come in two kinds:
- working versions and frozen versions. A working version of an
- object is one that can be modified. Every object has at least
- one version, which is the object itself. A working version may
- be frozen, after which it can no longer be modified.
-
- A new working version, called a descendent, can be made of a
- frozen object. The descendent looks like a new object that is a
- copy of the frozen object from which it came. The Storage
- Manager determines when it is necessary and efficient to make a
- copy of the frozen object, and makes the copy at that time.
-
-
- sm_CreateVersion (groupIndex, nearHint, parentObj, nearObj, oid)
- int groupIndex; /* IN buffer group to use */
- int nearHint; /* IN flag indicating where to create the new version near */
- OID *parentObj; /* IN object to create a version of */
- OID *nearObj; /* IN create the new version near this object */
- OID *oid; /* OUT the new version's OID */
-
-
- Sm_CreateVersion( ) creates a new version of the object "paren-
- tObj" in the file containing "parentObj". The arguments "groupIn-
- dex", "nearHint", and "nearObj" are used as in
- sm_CreateObject( ). The object identifier of the new version is
- returned in "oid". The object identified by "parentObj" must be a
- frozen version. The new version is a working version. The new
- version can be destroyed using sm_DestroyObject( ). When a new
- version is created, the P_VERSIONED property is set in the object
- header.
-
- Like sm_CreateObject( ), sm_CreateVersion( ) does not leave any-
- thing pinned in the buffer pool.
-
-
-
- 36
-
-
-
-
-
-
-
-
-
-
- sm_FreezeVersion (groupIndex, oid)
- int groupIndex; /* IN buffer group to use */
- OID *oid; /* IN object to be frozen */
-
-
- Sm_FreezeVersion( ) marks an object as frozen, preventing subse-
- quent modification of the object, and allowing new working ver-
- sions to be made from this object. When an object is frozen,
- both the P_VERSIONED and the P_FROZEN properties are set in the
- object header. Once frozen, an object cannot be unfrozen. A
- frozen object can be destroyed.
-
- 4.9. Operations on Files
-
- A Storage Manager file is a flexible container in which objects
- are place when they are created. No object exists outside a
- file.
-
- The objects in a file can be scanned, meaning that they are
- visited exactly once.
-
- Files do not have preallocated space or ownership properties.
- Various consistency guarantees can be associated with files, with
- the effect that updating data in different files has different
- costs.
-
- The Storage Manager offers operations for creating, destroying,
- scanning, bulk-loading files, and for changing the consistency
- guarantees associated with files. Some operations on files
- acquire locks on entire files. The locks acquired are described
- in Appendix A.
-
- A file is identified by a unique file identifier or FID. The
- Storage Manager does not provide a way to find all files or file
- identifiers that exist, so it is left to the application to keep
- track of its file identifiers. For example, consider an appli-
- cation that embeds file identifiers in objects to create a logi-
- cal hierarchy of files. The application had best destroy the
- files in a depth-first fashion, lest it lose a file identifier
- before the file it identifies is destroyed.
-
- The following two macros can be used to give a file identifier an
- illegitimate initial value, and later to recognize that value:
-
- INVALIDATE_FID (FID fid)
-
- sets "fid" to an invalid file identifier.
-
- FID_IS_INVALID (FID fid)
-
- returns TRUE if "fid" is the invalid identifier given by
- INVALIDATE_FID( ), FALSE otherwise.
-
-
-
- 37
-
-
-
-
-
-
-
-
-
-
- The rest of this section describes operations on files and opera-
- tions that concern entire files of objects.
-
- 4.9.1. Consistency Guarantees for Files
-
- The log level of a file determines what level of consistency is
- maintained for the file in the event that a transaction aborts or
- a server crashes. There are two log levels for files on data
- volumes: LOG_ALL and LOG_SPACE. LOG_ALL indicates that con-
- sistency is maintained for user data and meta-data. LOG_SPACE
- indicates that meta-data are guaranteed to be consistent. This
- means that all objects are available and that they are the
- correct size, but their contents may be corrupted. Files that
- have their log level set to LOG_SPACE are flushed when the tran-
- saction is committed. Data pages for large objects (objects that
- do not fit on a single disk page) may not be flushed, so there is
- no guarantee that the data is safely on disk until the server
- dismounts the volume. The log level is not a permanent attribute
- of a file. When an application sets the log level for a file, the
- setting lasts until it is changed or until sm_ShutDown( ) is
- called. If, in a transaction, the log level for a file is
- changed from LOG_SPACE to LOG_ALL, the Storage Manager guarantees
- only that the meta-data are consistent.
-
- LOG_ALL is the default log level for data files. LOG_SPACE is
- designed to conserve log space and increase performance for those
- files whose data integrity is not critical. For example, results
- of a query may be stored in a file with its log level set to
- LOG_SPACE, since file can be regenerated, in the event of a
- failure. To conserve log space when loading a large file, the
- log level for a file may be set to LOG_SPACE. Once the loading
- transaction is committed, the log level should be set to LOG_ALL.
-
- Files on temporary volumes can have only one log level: LOG_NONE.
- See Section 5.1.3, Temporary Volumes, for more information about
- temporary volumes.
-
- Sm_SetLogLevel( ) is used to change the log level for a list of
- files:
-
-
- sm_SetLogLevel (logLevel, fileCount, fids)
- int logLevel; /* IN log level */
- int fileCount; /* IN number of files to set level for */
- FID fid[]; /* IN list of files to set level for */
-
-
- The "logLevel" argument takes the values LOG_SPACE and LOG_ALL.
- The "fileCount" argument indicates the size of the last argument,
- "fid[]", which is a list of file identifiers of the files whose
- log levels are to be affected by this function. It is not an
- error for a file in the list already to have the given log level.
-
-
-
- 38
-
-
-
-
-
-
-
-
-
-
- If "fileCount" is zero, all files are given "logLevel".
-
- The volumes on which the files reside must be available for
- mounting, and a side effect of setting the log level is that the
- volumes are mounted.
-
- Sm_SetLogLevel( ) has no effect on files that reside on temporary
- volumes (see Section 5.1.3, Temporary Volumes).
-
- sm_CreateFile (groupIndex, volid, fid)
- int groupIndex; /* IN buffer group in use */
- VOLID volid; /* IN the volume in which to place the file */
- FID *fid; /* OUT the file ID of the new file */
-
-
- Sm_CreateFile( ) creates a new file on the volume indicated by
- "volid". The file identifier of the new file is returned in the
- structure to which "fid" points. The caller is responsible for
- allocating space for the FID.
-
-
- sm_DestroyFile (groupIndex, fid)
- int groupIndex; /* IN buffer group in use */
- FID *fid; /* IN the file to destroy */
-
-
- Sm_DestroyFile( ) destroys the file identified by "fid". The
- objects in the file are destroyed along with the file. Disk
- space is released when the transaction is committed.
-
-
- sm_GetFirstOid (groupIndex, fid, oid, objHdr, emptyFlag)
- int groupIndex; /* IN buffer group in use */
- FID *fid; /* IN the file */
- OID *oid; /* OUT first OID */
- OBJHDR *objHdr; /* OUT the object's header */
- BOOL *emptyFlag; /* OUT empty file flag */
-
-
- Sm_GetFirstOid( ) retrieves the object identifier and the object
- header of the first object in the file designated by "fid". The
- first object is the first object on the first physical page in
- the file. If the file does not contain any objects, "emptyFlag"
- is set to TRUE.
-
-
- sm_GetLastOid (groupIndex, fid, oid, objHdr, emptyFlag)
- int groupIndex; /* IN buffer group in use */
- FID *fid; /* IN the file */
- OID *oid; /* OUT last OID */
- OBJHDR *objHdr; /* OUT the object's header */
- BOOL *emptyFlag; /* OUT empty file flag */
-
-
-
- 39
-
-
-
-
-
-
-
-
-
-
- Sm_GetLastOid( ) retrieves the object identifier and the object
- header of the last object in the file designated by "fid". The
- last object is the last object on the last physical page in the
- file. If the file does not contain any objects, "emptyFlag" is
- set to TRUE.
-
-
- sm_GetNextOid (groupIndex, baseOid, nextOid, objHdr, endMarker)
- int groupIndex; /* IN buffer group in use */
- OID *baseOid; /* IN next relative to this object */
- OID *nextOid; /* OUT OID of the next object */
- OBJHDR *objHdr; /* OUT the object's header */
- BOOL *endMarker; /* OUT end-of-file flag */
-
-
- Sm_GetNextOid( ) retrieves the object identifier and the object
- header of the next object in the file relative to the object
- addressed by "baseOid". "EndMarker" is set to TRUE when end-of-
- file is reached (i.e., when there is no next object for
- sm_GetNextOid( ) to return).
-
- The next object is that which resides physically next in the
- file. There is no way to scan a file's objects in the order in
- which they were inserted in the file.
-
- The preferred method for retrieving all the objects in a file is
- to use scans, described in the next sub-section. Scans are more
- efficient than using sm_GetNextOid( ), which is present for back-
- ward compatibility.
-
-
- sm_GetPreviousOid (groupIndex, baseOid, prevOid, objHdr, endMarker)
- int groupIndex; /* IN buffer group in use */
- OID *baseOid; /* IN previous relative to this object */
- OID *prevOid; /* OID of the previous object */
- OBJHDR *objHdr; /* OUT the object's header */
- BOOL *endMarker; /* OUT start-of-file flag */
-
-
- Sm_GetPreviousOid( ) retrieves the object identifier and object
- header of the previous object in the file relative to the object
- addressed by "baseOid". "EndMarker" is set to TRUE when start-
- of-file is reached (i.e., when there is no next object for
- sm_GetPreviousOid( ) to return). Much like sm_GetNextOid( ), the
- previous object is the object that is physically previous in the
- file.
-
- 4.9.2. Scanning Files
-
- The objects in a file can be visited most efficiently by scanning
- the file. During a scan, the client library locks the entire file
- so that while one application is using the file, objects cannot
-
-
-
- 40
-
-
-
-
-
-
-
-
-
-
- be inserted, deleted, or changed by another application. The
- Storage Manager does not support a single application's modifying
- a file during a scan.
-
- The client library also some information about the state of the
- scan and the structure of the file being scanned. The informa-
- tion is stored in a scan descriptor, a structure of type SCAN-
- DESC, which is meant to be treated as opaque by the application.
-
-
- sm_OpenScanWithGroup (fid, type, groupIndex, scanDesc, oid)
- FID *fid; /* IN file to scan */
- int type; /* IN type of scan -- UNUSED */
- int groupIndex; /* IN buffer group for use in scan */
- SCANDESC **scanDesc; /* OUT returned scan descriptor */
- OID *oid; /* IN optional oid to begin scan -- UNUSED */
-
-
- Sm_OpenScanWithGroup( ) initializes a scan on the file indicated
- by "fid". A scan descriptor is passed back in "scanDesc", for
- use in subsequent scan calls. Using the scan mechanism can be
- considerably more efficient that using the sm_GetNextOid( ) call
- or sm_ReadObject( ). Scans use a buffer group, "groupIndex".
- This group should be created with the most-recently-used replace-
- ment policy, and its size should be tuned to reflect the buffer-
- ing requirements for the scan. The buffer group should have a
- size of at least five pages.
-
- Objects are scanned in the order in which they physically reside
- on disk. After sm_OpenScanWithGroup( ) returns, the scan pointer
- is before the first object in the file. This is true even if the
- file is empty, in which case the first call to
- sm_ScanNextObject( ) returns a flag indicating the end-of-file
- condition. The "type" and "oid" arguments are not used and are
- present for backward compatibility.
-
-
- sm_OpenScan (fid, type, groupSize, scanDesc, oid)
- FID *fid; /* IN file to scan */
- int type; /* IN type of scan -- UNUSED */
- int groupSize; /* IN size of buffer group in pages */
- SCANDESC **scanDesc; /* OUT returned scan descriptor */
- OID *oid; /* IN optional oid to begin scan -- UNUSED */
-
-
- Sm_OpenScan( ) is like sm_OpenScanWithGroup( ), but it is less
- flexible, and it is provided for backward compatibility. It is
- identical to sm_OpenScanWithGroup( ) except that it creates a
- buffer group with the most-recently-used replacement policy and
- size "groupSize". "GroupSize" should be at least five (pages).
- The buffer group is destroyed when the scan is closed.
-
-
-
-
- 41
-
-
-
-
-
-
-
-
-
-
- sm_ScanNextObject (scanDesc, start, length, retDesc, eof)
- SCANDESC *scanDesc; /* IN scan descriptor */
- int start; /* IN starting offset in object */
- int length; /* IN number of bytes to read */
- USERDESC **retDesc; /* OUT descriptor to access the data */
- BOOL *eof; /* OUT end of file indicator */
-
-
- sm_ScanNextObject( ) reads the next object in the file and pins
- the object as if sm_ReadObject( ) were used. "ScanDesc" is the
- scan descriptor returned when the scan was opened. "Start" is
- the starting offset within the object to return.
-
- "Length" is the length of the object read to perform. If "length"
- is READ_ALL, the entire object is read (assuming that the size of
- the entire object is not greater than the amount of unpinned
- space in the buffer group). To obtain the object header and OID
- information for the object, use a "length" of zero.
-
- sm_ScanNextObject( ) returns a user descriptor for the object, if
- there is one to pin, whether or not any bytes are pinned. "Eof"
- is set to TRUE and "retDesc" is set to NULL when there are no
- more objects to be scanned. Each call to sm_ScanNextObject( )
- releases the user descriptor returned by the previous scan call,
- so sm_ReleaseObject( ) must not be used on user descriptors that
- are acquired by scanning files.
-
-
- sm_ScanNextBytes (scanDesc, length)
- SCANDESC *scanDesc; /* IN scan descriptor */
- int length; /* IN number of bytes to read */
-
-
- Sm_ScanNextBytes( ) is useful when a file being scanned contains
- very large objects that cannot be expected to fit in memory. A
- sm_ScanNextObject( ) call can be made with a relatively small
- length to read in the first section of an object.
- Sm_ScanNextBytes( ) is used subsequently to iterate over the rest
- of that object, with each call reading in the next "length" bytes
- of the current scan object. The iteration can be controlled by
- observing the objectSize field of the user descriptor. esmEN-
- DOFOBJECT is returned if there are no more bytes to be read in
- the current object.
-
-
- sm_CloseScan (scanDesc)
- SCANDESC *scanDesc; /* IN scan descriptor */
-
-
- Sm_CloseScan( ) closes the scan associated with "scanDesc". It
- releases the scan descriptor and the user descriptors and data
- pinned during the scan.
-
-
-
- 42
-
-
-
-
-
-
-
-
-
-
- 4.9.3. Bulk-loading Files
-
- WARNING: the file bulk load facility does not work properly in
- version 3.1. We recommend that it not be used.
-
-
- sm_OpenLoad (fid, type, groupSize, fillFactor, loadDesc)
- FID *fid; /* IN file to scan */
- int groupSize; /* IN size of load buffer group */
- float fillFactor; /* IN fill percentage */
- LOADDESC **loadDesc; /* OUT returned load descriptor */
-
-
- Sm_OpenLoad( ) prepares to load a set of objects into a file in
- bulk. Bulk loading a file can be more efficient than using a
- series of sm_CreateObject( ) calls. The file, indicated by
- "fid", need not be empty, in which case the new objects are
- added to the end of the file. The load mechanism creates and uses
- its own buffer group; the size of the buffer group is "group-
- Size". The "fillFactor" argument is ignored; it is present for
- future extensions. A load descriptor, "loadDesc" is returned for
- use in subsequent operations ( sm_LoadNextObject( ) and
- sm_CloseLoad( )).
-
-
- sm_LoadNextObject (loadDesc, length, data, oid)
- LOADDESC *loadDesc; /* IN load descriptor */
- int length; /* IN length of the object */
- void *data; /* IN the object's data */
- OID *oid; /* OUT returned new object id */
-
-
- Sm_LoadNextObject( ) creates a new object if size "length" in the
- file for which the "loadDesc" was opened. The new object is ini-
- tialized with "data". If "data" is NULL, the object is filled
- with zeroes. Sm_LoadNextObject( ) returns an object identifier
- for the new object in "oid".
-
-
- sm_CloseLoad (loadDesc)
- LOADDESC *loadDesc; /* IN load to close */
-
-
- Sm_CloseLoad( ) ends the bulk-load operation.
-
- 4.10. Operations on Indexes
-
- The Storage Manager's index facility associates keys with fixed-
- length elements. The keys can be any basic C data type (SM_int,
- SM_long, SM_short, SM_float, SM_double) or strings (SM_string).
- The size of the element is fixed when the index is created.
-
-
-
-
- 43
-
-
-
-
-
-
-
-
-
-
- B[+]tree index and linear hashing index functions are imple-
- mented. B[+]tree provides fast index lookup on all kinds of
- queries, especially range queries. Linear hashing provides even
- faster index lookup and supports linear space growth for dynami-
- cally growing indexes, but it supports only exact-match queries.
- More information about linear hashing can be found in [Litw88].
-
- A key is fully described by the KEY structure:
-
-
- typedef struct {
- TWO length; /* length of the key */
- void* valuePtr; /* pointer to value of the key */
- } KEY;
-
-
- Index keys are compared according to the key type given when the
- index is created. The key type determines the number of bytes
- considered in a key comparison. In the case of keys that are
- strings, the length fields in the keys in question determine the
- number of bytes compared. Strings are compared one character at
- a time. The client library does not terminate strings with
- nulls. When two strings of different lengths are compared, the
- shorter string is compared with the corresponding substring of
- the longer string. If the shorter string and the corresponding
- substring are equal, the longer string is considered to be the
- larger of the two. This means that "abc " is longer than "abc".
-
- Characters are compared as ASCII values.
-
- 4.10.1. Creating and Destroying Indexes
-
- When an index is created, the client library creates a handle, by
- which the index is identified in subsequent operations. The han-
- dle is an index identifier, a structure of type IID. The value
- of the index identifier can be treated as an opaque value by the
- application.
-
- The following macros can be used it give an illegitimate initial
- value to an index identifier, and later to recognize that value:
-
- INVALIDATE_IID (IID iid)
-
- sets "iid" to an invalid index identifier.
-
- IID_IS_INVALID (IID iid)
-
- returns TRUE if "iid" has the value given by INVALIDATE_IID( ),
- FALSE if not.
-
- The rest of this section describes the functions that operate on
- indexes.
-
-
-
- 44
-
-
-
-
-
-
-
-
-
-
- sm_CreateIndex(volume, groupIndex, ndxType, keyType, maxKeyLen, elSize, unique, ndx)
- VOLID volume; /* IN volume on which index is to be built */
- int groupIndex; /* IN the buffer group to use */
- SMTYPE ndxType; /* IN SM_BTREENDX, SM_HASHNDX, etc */
- SMDATATYPE keyType; /* IN SM_int, SM_long, SM_string, etc */
- int maxKeyLen; /* IN maximum key length of a key in the index */
- int elSize; /* IN element size (mpl of 4, < SM_MAXELEMLEN) */
- BOOL unique; /* IN TRUE if key is unique */
- IID* ndx; /* OUT returned index identifier */
-
-
- Sm_CreateIndex( ) creates an index that resides on "volume". [3]
- "NdxType" specifies the type of index (SM_BTREENDX for B[+]tree
- or SM_HASHNDX for linear hashing). "KeyType" indicates the data
- type of the key. The maximum length of a key in the index is
- given in "maxKeyLen". The size of the elements in the index is
- given in "elSize". The element size must be a multiple of four
- and less than SM_MAXELEMLEN (20). If "unique" is FALSE, the index
- is able to store multiple elements under the same key. An index
- identifier is returned in "ndx" upon successful completion.
-
-
- sm_DestroyIndex(ndx, groupIndex)
- IID* ndx; /* IN id of index to destroy */
- int groupIndex; /* IN which buffer group to use */
-
-
- Sm_DestroyIndex( ) destroys the index associated with "ndx".
-
-
- sm_SetLHashLoadThreshold(ndx, groupIndex, load)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN which buffer group to use */
- float loadFactor; /* IN the load factor to use for linear hashing */
-
-
- Sm_SetLHashLoadThreshold( ) changes the load factor for a linear
- hashing index from the default 75% to the given "loadFactor".
- The default load factor, 75%, yields the best access time and
- space utilization. See [Litw88] for information about linear
- hashing and when it might be useful to change the load factor.
- The load factor can be set only on a newly created index.
-
-
-
- ____________________
- [3] Indexes on temporary volumes are not implemented.
- (Section 5.1.3, Temporary Volumes). If the volume given is
- temporary, sm_CreateIndex( ) returns esmFAILURE, with error code
- esmNOTIMPLEMENTED.
-
-
-
-
-
- 45
-
-
-
-
-
-
-
-
-
-
- 4.10.2. Inserting and Removing Index Elements
-
-
- sm_InsertEntry(ndx, groupIndex, key, elem)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN which buffer group to use */
- KEY* key; /* IN key to insert */
- void* elem; /* IN element associated with key */
-
-
- Sm_InsertEntry( ) inserts a <key, elem> pair into the index
- "ndx". If "ndx" is a unique index and the key to be inserted
- already appears in the index, sm_InsertEntry( ) returns an error
- in sm_errno. If the index is not unique, there is no limit to
- the number of duplicate keys as long as different elements are
- associated with them.
-
-
- sm_RemoveEntry(ndx, groupIndex, key, elem)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN which buffer group to use */
- KEY* key; /* IN key to remove */
- void* elem; /* IN element associated with key */
-
-
- Sm_RemoveEntry( ) removes a <key, elem> pair from the index
- "ndx".
-
- 4.10.3. Loading Indexes in Bulk
-
- The Storage Manager provides a bulk-load facility for efficiently
- loading an empty index. When the application begins a bulk-load
- operation, the client library allocates a temporary run-buffer,
- which is used for sorting runs. Henceforth, the application uses
- sm_InsertEntry( ) repeatedly to load elements into index; no
- other index operations are allowed during a bulk-load. Each
- sm_InsertEntry( ) operation for the index inserts a <key, elem>
- pair into the temporary run buffer. The run buffer is sorted and
- written to the work file as a "sorted-run" when it is full. When
- the application terminates the bulk-load operation, the client
- library merges the sorted-runs into a sorted stream, from which
- the index is built from the bottom, up.
-
- Entries cannot be removed during a bulk-load operation.
-
-
-
- int sm_BeginIndexLoad(ndx, groupIndex, workVolume, runSize)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN the buffer group to use */
- VOLID workVolume; /* IN work volume */
- int runSize; /* IN size of each sorted run in pages */
-
-
-
- 46
-
-
-
-
-
-
-
-
-
-
- Sm_BeginIndexLoad( ) prepares to load the index given in "ndx",
- using the buffer group "groupIndex". Sm_BeginIndexLoad( ) uses
- the volume named by "workVolume" for the sorted runs. Using a
- temporary volume for the work volume yields the best performance
- (see Section 5.1.3, Temporary Volumes).
-
- The "runSize" argument determines how many MIN_PAGESIZE pages to
- fill before ending a run. The larger "runSize", the more memory
- is consumed by the bulk-load, with a commensurate improvement in
- speed. Sm_BeginIndexLoad( ), if it is used, must be the first
- operation performed on an index.
-
-
-
- int sm_EndIndexLoad(ndx)
- IID* ndx; /* IN index identifier */
-
- Sm_EndIndexLoad( ) concludes the bulk-load and builds the index.
-
-
- int sm_AbortIndexLoad(ndx)
- IID* ndx; /* IN index identifier */
-
- sm_AbortIndexLoad( ) aborts the bulk-loading of an index. All
- resources used by the index are freed.
-
- 4.10.4. Scanning Indexes
-
- Indexes are used by posing queries with the sm_FetchInit( )
- operation. A query requests all the elements whose key values lie
- in a range. The results of the query are fetched, one element at
- a time, with the sm_FetchNext( ) operation. An index scan uses a
- cursor, a value of the type SMCURSOR. A cursor can be treated by
- the application as an opaque value. The following two macros
- give a cursor an invalid initial value and recognize that value:
-
- INVALIDATE_CURSOR (SMCURSOR cursor)
-
- sets "cursor" to an invalid index scan cursor.
-
- CURSOR_IS_INVALID (SMCURSOR cursor)
-
- returns TRUE if "cursor" is the value given by
- INVALIDATE_CURSOR( ), FALSE if not.
-
- The rest of this section describes the functions used to scan
- indexes.
-
-
-
-
-
-
-
-
- 47
-
-
-
-
-
-
-
-
-
-
- sm_FetchInit(ndx, groupIndex, bound1, cond1, bound2, cond2, cursor)
- IID* ndx; /* IN index identifier */
- int groupIndex; /* IN which buffer group to use */
- KEY* bound1; /* IN starting bound of the scan */
- SMCOND cond1; /* IN starting condition */
- KEY* bound2; /* IN ending bound of the scan */
- SMCOND cond2; /* IN ending condition */
- SMCURSOR* cursor; /* OUT returned pointer if non-NULL */
-
-
- Sm_FetchInit( ) begins a scan on the index "ndx". The arguments
- "bound1" and "cond1" specify the beginning search condition.
- "Bound2" and "cond2" specify the ending search condition. The
- conditions can be SM_EQ, SM_G, SM_L, SM_GEQ, or SM_LEQ. The
- "cursor" argument is initialized by sm_FetchInit( ) and used by
- sm_FetchNext( ). The caller is responsible for allocating the
- space for the cursor and the client library is responsible for
- the value of the cursor.
-
- The direction of the scan (ascending or descending) is determined
- by the bounds and conditions of the query. The beginning and end
- of an index are specified with the macros SM_BOF and SM_EOF. For
- linear hashing indexes (type SM_HASHNDX), the value that can be
- used for "cond1" and "cond2" is SM_EQ.
-
- Several examples of queries follow:
-
- (1) Scan from key1 = "10" to key2 = "30" inclusively:
- sm_FetchInit( ..., key1, SM_GEQ, key2, SM_LEQ, cursor) ---
- ascending
- sm_FetchInit( ..., key2, SM_LEQ, key1, SM_GEQ, cursor) ---
- descending
-
-
- (2) Scan from key1 = "10" to the end of the index:
- sm_FetchInit( ..., key1, SM_GEQ, SM_EOF, cursor) ---
- ascending
- sm_FetchInit( ..., SM_EOF, key1, SM_GEQ, cursor) --- des-
- cending
-
-
- (3) Scan the whole index:
- sm_FetchInit( ..., SM_BOF, SM_EOF, cursor) --- ascending
- sm_FetchInit( ..., SM_EOF, SM_BOF, cursor) --- descending
-
- sm_FetchNext(cursor, retKey, retElem, eof)
- SMCURSOR* cursor; /* IN cursor from sm_Fetch( ) */
- KEY* retKey; /* OUT returned key (optional) */
- void* retElem; /* OUT elem */
- BOOL* eof; /* OUT to TRUE if EOF reached */
-
-
-
-
-
- 48
-
-
-
-
-
-
-
-
-
-
- Sm_FetchNext( ) fetches the next element returned by a query.
- The element is returned in the structure addressed by "retElem".
- A copy of the key can also be returned to the caller. If "ret-
- Key" is NULL, no key is returned. If "retKey" points to a key
- structure, the key is returned in that structure. The "length"
- field in the key structure must indicate amount of space avail-
- able in the target of the "valuePtr" field. This must be enough
- for the longest key in the index. The caller is responsible for
- allocating space for "retKey" and "retElem".
-
- sm_FetchNext( ) returns FALSE in "eof" if an element is returned.
- If there are no more elements that satisfy the query, TRUE is
- returned in "eof".
-
-
- 4.11. Advanced Topics
-
- 4.11.1. External Two-Phase Commit Functions
-
- The Storage Manager can particpate in transactions coordinated by
- other software modules that employ the two-phase commit
- "presumed abort" transaction semantics and protocol. (For the
- purpose of this section, the reader is assumed to be familiar
- with the "presumed abort" protocol.) The coordinator in such a
- situation is external to the Storage Manager; it is assumed to
- have its own stable storage, and it is assumed to recover from
- failures in a short time (the precise meaning of which is given
- forthwith).
-
- A prepared transaction, like an active transaction, consumes log
- space on one or more Exodus servers, beginning at a fixed loca-
- tion in each log. A Storage Manager server's log is like a cir-
- cular buffer; it wraps and reuses the beginning of the log. If
- long-running or prepared transactions are still in the system,
- the server eventually tries to re-use log space consumed by the
- oldest transaction, at which point it effectively runs out of log
- space. A coordinator must resolve its prepared transactions
- before the servers run out of log space. The amount of time
- involved is a function of the size of the log on the participat-
- ing servers and the load on those servers.
-
- For the purpose of this discussion, the portion of a global tran-
- saction that involves a single Exodus Storage Manager transaction
- is called a thread of the global transaction. Each thread has,
- in addition to its local transaction identifier, a global tran-
- saction identifier. Global transaction identifiers are provided
- by the application or some external authority, and must be
- unique. A global transaction identifier has type GTID, defined
- in sm_client.h, as follows:
-
-
-
-
-
-
- 49
-
-
-
-
-
-
-
-
-
-
- #define MAXOPAQUELEN 255
- typedef struct {
- int length; /* maximum MAXOPAQUELEN bytes */
- u_char opaque[MAXOPAQUELEN];
- } GTID;
-
-
- The Storage Manager does not interpret the contents of the opaque
- part of the global transaction identifier.
-
- An application that invokes the external two-phase commit proto-
- col can find itself in any of the transaction states mentioned in
- Section 4.3.2 ("Transaction States"). It can also find itself in
- the PREPARED state after a call to sm_PrepareTransaction( ). An
- application in PREPARED state calls sm_CommitTransaction( ) or
- sm_AbortTransaction( ) to complete the transaction and return to
- the INACTIVE state.
-
- While the coordinator for a global transaction is external to the
- Storage Manager, a single Storage Manager server corresponds with
- the client library and coordinates the Storage Manager servers
- that participate in the thread. If the application should crash
- during a two-phase commit, a new application program (represent-
- ing the global coordinator) must run, and it must contact the
- Storage Manager that is acting as the thread's coordinator. In
- order to locate the proper server, a two-phase commit process
- begins by informing the client library that a transaction is a
- thread of a global transaction, and by identifying the thread's
- coordinator. The function sm_Enter2PC( ), described below,
- accomplishes this.
-
-
- sm_Enter2PC (tid, gtid, handle)
- TID tid; /* IN transaction ID */
- GTID *gtid; /* IN global transaction ID */
- COORD_HANDLE *handle; /* OUT for use if client crashes */
-
-
-
- The application supplies the local and global transaction iden-
- tifiers. The client library identifies a thread coordinator, and
- produces a handle for the application to write to stable storage.
- The handle identifies the thread coordinator; it is used by
- sm_Recover2PC( ) if the client crashes before the two-phase com-
- mit is completed.
-
- The handle must be written to stable storage before the first
- phase of the commit begins, otherwise the application and Storage
- Manager may not be able to recover from a subsequent application
- failure.
-
-
-
-
-
- 50
-
-
-
-
-
-
-
-
-
-
- sm_PrepareTransaction (tid, vote)
- TID tid; /* IN transaction ID */
- VOTE *vote; /* OUT result of first phase */
-
-
-
- The application calls sm_PrepareTransaction( ) to begin the
- first, or prepare, phase of a two-phase commit.
- sm_PrepareTransaction( ) determines if the participating servers
- are able to commit the transaction, and directs them to prepare
- to commit if they are. If any of the participating servers is
- unable to commit the transaction, the vote returned is NOVOTE,
- sm_PrepareTransaction( ) sets sm_error to esmTRANSABORTED,
- sm_reason to esmTRANSNOTPREPARED, and returns esmFAILURE; the
- application must call sm_AbortTransaction( ).
-
- If all participating servers are able to commit, and any of them
- logged updates during the transaction, the vote is YESVOTE, and
- the transaction state becomes PREPARED. If the transaction did
- not update any data on any of the servers, the vote is READVOTE,
- and the transaction state becomes INACTIVE.
- Sm_PrepareTransaction( ) returns esmNOERROR if the transaction
- becomes prepared (all servers vote YESVOTE) or committed (all
- server vote READVOTE).
-
- If an error occurs during the prepare phase,
- sm_PrepareTransaction( ) returns esmFAILURE. If it is a recover-
- able error, the client library returns an error code specific to
- the error in sm_errno (such as esmTRANSDISABLED if a server is
- performing recovery), and the application can try again to call
- sm_PrepareTransaction( ). Some errors, on the other hand, cause
- the transaction to be aborted, in which case
- sm_PrepareTransaction( ) returns esmTRANSABORTED in sm_errno, and
- a vote of NOVOTE.
-
- If an application crashes during the first phase, the application
- must retry the prepare phase and complete the transaction. If it
- does not retry the prepare phase, and the transaction was indeed
- prepared before the application crashed, the prepared transaction
- consumes resources indefinitely, and eventually its servers will
- run out of log space.
-
- Once a transaction is prepared, an application must invoke the
- second phase by aborting or committing the transaction (calling
- sm_AbortTransaction( ) or sm_CommitTransaction( ), respectively).
- It is an error to commit a global transaction thread without
- first preparing the transaction, and it is an error to do any-
- thing else without completing the second phase.
-
- When an error occurs during the second phase, the application
- cannot tell if the second phase completed (the transaction indeed
- committed or aborted). It is alway safe to try again to complete
-
-
-
- 51
-
-
-
-
-
-
-
-
-
-
- the transaction by calling sm_AbortTransaction( ) or
- sm_CommitTransaction( ) again.
-
- If the second phase fails because the network connection between
- the client and the thread coordinator breaks (esmSERVERDIED or
- esmNOTCONNECTED), the client must reconnect to the thread coordi-
- nator before the second phase can be finished. The following
- function does that:
-
-
- sm_Continue2PC (tid, willing2block)
- TID tid; /* IN transaction ID */
- BOOL willling2block; /* IN ok to block indefinitely */
-
-
- If "willing2block" is TRUE, the client library blocks until it
- connects to the thread coordinator. If this is inappropriate for
- the application, "willing2block" must be FALSE, and the client
- library tries once to contact the thread coordinator.
-
- If the application crashes, its replacement must use
- sm_Recover2PC( ), below, instead of sm_Continue2PC( ) to resolve
- the transaction.
-
-
- sm_Recover2PC (gtid, handle, willing2block, tid)
- COORD_HANDLE *handle; /* IN handle for thread coordinator */
- GTID *gtid; /* IN global transaction ID */
- BOOL willing2block; /* IN ok to block indefinitely */
- TID *tid; /* OUT local transaction ID */
-
-
- When the application crashes (exits) after a transaction is
- prepared but before its second phase is completed, a "recovery"
- application program must be run within a short time to finish the
- two-phase commit and resolve the transaction. This recovery
- application must use sm_Recover2PC( ), supplying the global tran-
- saction identifier and the handle returned by sm_Enter2PC( ) for
- that global transaction. The client library contacts the server
- identified in the handle, which conveys to the client library all
- that is needed for the application to enter or to retry the
- second phase. The transaction's local transaction identifier is
- returned by sm_Recover2PC( ) for the application to use in its
- subsequent call to sm_CommitTransaction( ) or
- sm_AbortTransaction( ).
-
- The thread coordinator may not be available, in which case the
- client library keeps trying to connect or it will return an error
- (such as ECONNREFUSED), depending on the value of
- "willing2block". If "willing2block" is FALSE, the client library
- tries only once to connect the thread coordinator.
-
-
-
-
- 52
-
-
-
-
-
-
-
-
-
-
- 4.11.2. Administrative Operations
-
- The following functions can be applied to one or more servers.
- Each function takes two arguments that determine which servers
- are of interest. The first argument is of type FLAGS, and takes
- one of the following values:
-
-
- VOL_ALL /* the servers for all volumes */
- VOL_USED_SINCE_INIT /* servers for all volumes used */
- VOL_USED_IN_TRANSACTION /* servers used in this transaction */
- VOL_BY_VOLID /* the second argument applies */
-
- The client library keeps a list of volumes and the servers that
- manage those volumes. The list is created from the information
- given in the configuration files and information passed to the
- library through sm_SetClientOption( ), The flag VOL_ALL directs
- the client library to apply the administrative operation to the
- server that manages each volume in its list of known volumes.
- The flag VOL_USED_SINCE_INIT directs the client library to apply
- the administrative operation to each server contacted since
- sm_Initialize( ) was called. The flag VOL_USED_IN_TRANSACTION
- directs the client library to apply the administrative operation
- to each server contacted so far for participation in the current
- transaction. (It does not apply to servers to be contacted for
- the first time later in the transaction.) The flag VOL_BY_VOLID
- directs the client library to apply the administrative operation
- to the server that manages the volume identified by the second
- argument. The second argument is a volume identifier VOLID,
- which is ignored when the flags argument is VOL_ALL,
- VOL_USED_SINCE_INIT, or VOL_USED_IN_TRANSACTION.
-
- Ideally the administrative operations would only be performed by
- trusted clients, but the Storage Manager does not restrict their
- use.
-
-
- sm_TakeCheckpoint (flags, volid, numCheckpoints)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which server is of interest */
- short numCheckpoints; /* IN number of checkpoints to take */
-
-
- Sm_TakeCheckpoint( ) sends a request to the server to take a
- number of checkpoints. In most circumstances, a value of one for
- the "numCheckpoints" argument is appropriate. A value greater
- than 1 can be used to ensure that the server flushes all pages
- that were dirty when the first checkpoint was taken. (This is
- useful for experimenting with the recovery facility).
-
-
-
-
-
-
- 53
-
-
-
-
-
-
-
-
-
-
- sm_ChangeCheckpointFrequency (flags, volid, frequency)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which server is of interest */
- int frequency; /* IN number of log records between checkpoints */
-
-
- Sm_ChangeCheckpointFrequency( ) changes the frequency of check-
- points taken by the server. The checkpoint frequency is based on
- the number of log pages written. More information about check-
- point frequency can be found in Section 5.3, Tuning the Server.
-
-
- sm_ShutdownServer (flags, volid, options)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which server is of interest */
- FLAGS options; /* IN shutdown options */
-
-
- Sm_ShutdownServer( ) directs servers to shut down. The "options"
- argument indicates what a server should do before exiting. The
- following flags are available: NOFLAGS, SHUT_TAKE_CHECKPOINT,
- SHUT_DUMP_CORE, SHUT_ABORT_TRANS, SHUT_COMMIT_TRANS,
- SHUT_CLEAN_VOLUMES. These can be combined with the logical "or"
- operator.
-
- If NOFLAGS is given, the server kills the disk processes and
- exits.
-
- SHUT_TAKE_CHECKPOINT directs the server to take a checkpoint
- before exiting.
-
- SHUT_DUMP_CORE directs the server to dump a core file debugging
- (see core(5)).
-
- SHUT_COMMIT_TRANS directs the server to wait until the running
- transactions either commit or abort before it shuts down.
-
- SHUT_ABORT_TRANS directs the server to abort all running transac-
- tions before shutting down. When SHUT_COMMIT_TRANS or
- SHUT_ABORT_TRANS is used, clients cannot start any new transac-
- tions.
-
- SHUT_CLEAN_VOLUMES directs the server to write dirty pages to
- disk before exiting. To shut down a server after which recovery
- is not required, use either SHUT_COMMIT_TRANS |
- SHUT_CLEAN_VOLUMES or SHUT_ABORT_TRANS | SHUT_CLEAN_VOLUMES.
-
-
-
-
-
-
-
-
-
- 54
-
-
-
-
-
-
-
-
-
-
- sm_ServerStatistics (flags, volid, numServers, stats, reset)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which server is of interest */
- int *numServers; /* OUT # servers contacted */
- SERVERSTATS **stats; /* OUT servers' statistics */
- BOOL reset; /* IN TRUE = reinitialize counters */
-
-
- Sm_ServerStatistics( ) obtains statistics about one or more
- servers. For each server contacted, a set of statistics is
- returned. The client library allocates space for the statistics,
- and the application is responsible for freeing that space ( see
- the manual page for malloc(3) ). The "flags" indicate which
- servers are of interest, and the number of servers contacted is
- returned in "*numServers". On return from
- sm_ServerStatistics( ), the "*stats" pointer addresses an array
- of "*numServers" SERVERSTATS structures. This array must be
- freed by the application with one call to free(3).
-
- If "reset" is TRUE, the statistics labeled as counters below are
- reset to zero.
-
- The SERVERSTATS structure looks like this:
-
- typedef struct {
- int numClients; /* # clients connected */
- int numTrans; /* # transactions in progress */
- int numVolumes; /* # volumes mounted */
- int freeLogSpace; /* approximate # bytes free log space */
- int chpntFreq; /* checkpoint frequency */
- int totalCommits; /* # transactions committed */
- int totalAborts; /* # transactions aborted */
- int diskReads; /* # disk reads */
- int diskWrites; /* # disk writes */
- MESSAGESTATS msgStats; /* server's message counters */
- } SERVERSTATS;
-
-
- The MESSAGESTATS structure contains statistics about the client-
- server protocol and the server-server protocol. A set of these
- statistics is kept by the client library a set is kept by each
- server. The client library's statistics are found in the global
- structure
-
- extern MESSAGESTATS MsgStats;
-
- The MESSAGESTATS structure contains the following counters for
- each message type: messages sent, messages received, replies
- received with an error indication, replies received with no
- error, messages sent with no reply requested. The counters for
- replies have two different meanings, depending on which set
- statistics is concerned. The servers count the replies sent with
-
-
-
- 55
-
-
-
-
-
-
-
-
-
-
- and without error indications, and the number of requests that
- the server received that did not require a reply at all. The
- client library counts the replies received with and without error
- indications, and the number of requests that the client sent that
- did not require a reply at all.
-
- The following function prints the MESSAGESTATS structure:
-
-
- sm_PrintMessageStats (file, stats)
- FILE *const file; /* IN where to print */
- MESSAGESTATS *const msgStats; /* IN what to print */
-
-
- The following function tells if a mounted volume is temporary
- volume, a data volume, or a log volume. See Section 5.1, Manag-
- ing Volumes, for information about volumes.
-
-
- sm_VolumeProperties (volid, properties)
- VOLID volid; /* IN which volume is of interest */
- int *properties; /* OUT the properties */
-
-
- Sm_VolumeProperties( ) returns a set of bits that tell whether
- the given volume is a data volume or a temporary volume. The
- "volid" argument is the volume identifier of the volume in ques-
- tion. If the volume is not mounted when Sm_VolumeProperties( )
- is called, Sm_VolumeProperties( ) mounts it.
-
- VOLPROP_TEMP indicates that the volume is temporary (see Section
- 5.1.3, Temporary Volumes). If the bit VOLPROP_TEMP is not set in
- the result, the volume is a data volume. A log volume cannot be
- mounted by a client, and an attempt to get a log volume's proper-
- ties results in an error.
-
-
- sm_AddServerVolume (flags, volid, option, value)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which volume is of interest */
- char *option; /* IN which format option to use */
- char *value; /* IN value for the format option */
-
-
- Sm_AddServerVolume( ) adds a volume to the list of mountable
- volumes on one or more servers (although it seldom makes sense to
- do this on more than one server with a single pair of arguments).
- The "flags" argument indicates which servers are of interest.
- The "volid" argument is the volume identifier of the volume that
- will determine which server to contact when "flags" ==
- VOL_BY_VOLID. The "option" is one of the server's format options
- ("dataformat" or "tempformat"). The "value" argument is the
-
-
-
- 56
-
-
-
-
-
-
-
-
-
-
- value to be given the option named in "option".
-
- Sm_AddServerVolume( ) adds the named volume to the server's list
- of known volumes, but the server does not try to mount the volume
- or verify that the volume exists or is valid.
- Sm_AddServerVolume( ) fails if the value given conflicts with
- another volume already in the server's table, either in the path
- name or the volume identifier. If your objective is to change
- the format information for a path name that is in the server's
- table, first remove the existing format information (using
- sm_RemoveServerVolume( ), described below), and subsequently add
- the new information.
-
-
- sm_RemoveServerVolume (flags, volid, volid2remove)
- FLAGS flags; /* IN which servers are of interest */
- VOLID volid; /* IN which volume id of server of interest */
- VOLID volid2remove; /* IN which volume to remove */
-
-
- Sm_RemoveServerVolume( ) removes "volid2remove" from one or more
- servers' lists of mountable volumes. The volume cannot be
- removed from a server's table while the volume is in use. it must
- be dismounted before it is removed.
-
- See also Section 5.1, Managing Volumes.
-
- 4.11.3. Tuning the Application
-
- The size of the application's buffer pool, determined by the
- "bufpages" option, is the primary tuning parameter that is under
- the control of applications. The "bufpages" option indicates the
- number of MIN_PAGESIZE pages in the buffer pool. It should be set
- large enough to hold the application's working set of objects.
- The buffer pool must not exceed the size of physical memory
- available to the client.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 57
-
-
-
-
-
-
-
-
-
-
- 5. USING STORAGE MANAGER SERVERS
-
- Storage Manager servers provide disk, file, transaction, con-
- currency control, and recovery services to clients. In most
- respects, users do not have to understand how servers work, but
- there are a few things that administrators should know; we focus
- on those things in this section. The first half of this section
- explains how to manage volumes. The second half explains how to
- operate a server.
-
-
- 5.1. Managing Volumes
-
- Servers store data on volumes, which can be Unix files or raw
- disk partitions. Each server is composed of a server process and
- one disk process for each mounted volume. When a server requires
- I/O, it asks the appropriate disk process to read from or write
- to the server's buffer pool, which is located in a Unix System V
- shared-memory segment. The disk processes perform I/O so that the
- server never blocks when I/O is required. The server mounts a
- volume before using it, and the server dismounts the volume when
- it is no longer in use. Mounting a volume consists in forking a
- disk process for that volume. Dismounting the volume consists in
- flushing all dirty pages to the disk and killing the volume's
- disk process.
-
- Volumes are created with the formatvol program, which establishes
- a volume's identifier, size, type, and other characteristics.
- Volumes come in three types: log volumes, data volumes, and tem-
- porary volumes.
-
- 5.1.1. Log Volumes
-
- Log volumes are used to store log information for aborting tran-
- sactions and for recovery. The server has one log volume mounted
- at all times.
-
- 5.1.2. Data Volumes
-
- Data volumes are used to store objects and indexes that are meant
- to exist after a transaction ends. Changes to data volumes are
- logged so that transactions can be aborted or committed with
- reliability, and so that recovery can be performed after a crash.
-
- 5.1.3. Temporary Volumes
-
- Some applications store temporary private data and do not need
- concurrency control or recovery. The Storage Manager provides
- temporary volumes for this purpose. Locks are not acquired for
- data in temporary volumes, and updates to temporary volumes are
- not logged. Temporary volumes are less costly to use than data
- volumes are, but the data on them cannot be shared among
-
-
-
- 58
-
-
-
-
-
-
-
-
-
-
- transactions. The data on temporary volumes are deleted at the
- conclusion of the transaction that creates them, regardless of
- whether the transaction is committed or aborted. Temporary
- volumes cannot contain root entries.
-
- The server can serve many data volumes and temporary volumes
- simultaneously.
-
- 5.1.4. Raw Partitions and Unix Files
-
- A volume can be a Unix file or a Unix raw partition. When a raw
- partition is used, data are transferred between the server's
- buffer pool and the disk by the disk process, bypassing the Unix
- file system's buffer pool.
-
- When a Unix file is used, the data are written to the Unix file
- system's buffer pool, and the operating system worries about
- flushing the data to the disk. In this case, the server forces
- the data to the disk periodically with a Unix fsync( ) system
- call.
-
- 5.1.5. Formatting Volumes
-
- Before a volume can be used, it must be formatted. This is done
- using the formatvol program, which can also display information
- about previously formatted volumes. Formatvol uses the confi-
- guration options "dataformat", "tempformat", and "logformat" to
- determine what characteristics to give volumes that it formats.
- The options have values that list the following information:
-
- path The Unix path name of the volume, e.g., /dev/rz2c.
-
- volid The volume identifier for this volume, an integer,
- e.g., 8000.
-
- #cyl The number of cylinders on this disk, e.g., 1224 for a
- DEC RZ55. May be 1.
-
- #trk/cyl The number of tracks per cylinder e.g., 15 for a DEC
- RZ55. May be 1.
-
- #sect/trk The number of sectors or blocks per track e.g., 36 for
- a DEC RZ55. May be the number of blocks in the file.
- A block is MIN_PAGESIZE bytes; MIN_PAGESIZE is defined
- in sm_client.h. (This is determined by the Storage
- Manager, not by the device.) [4]
-
- #KB/pg For logformat only. This gives the page size for log
- pages, in kilobytes. The value given here may be 4 or
- larger, and must be a power of 2.
-
- Formatvol collects the format information from the options in the
- configuration files, after which it determines which volumes to
- format or to display by processing the options "volume" and
- "display" from the command line. The options that formatvol
- understands are summarized in Table 2.
-
-
-
-
-
-
-
-
-
-
-
- _________________________________________________________________________________
- Option Option Option
- Name Type Description
- _________________________________________________________________________________
- tempformat string,int,int,int path,volid,#cyl,#trk/cyl,#sect/trk
- dataformat string,int,int,int path,volid,#cyl,#trk/cyl,#sect/trk
- logformat string,int,int,int,int path,volid,#cyl,#trk/cyl,#sect/trk,#KB/pg
- volume int volume to format - command line only
- display int volume to display - command line only
- _________________________________________________________________________________
- |
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
- Table 2: Formatvol Options.
- Fields are separated by white space, commas, colons or semicolons.
-
-
- For example, to print information about the volumes with volids
- 8000 and 4000 use:
-
- formatvol -dis 8000 -dis 4000
-
-
- To format a data volume with volid 8000 and a temporary volume
- with volid 4000 use:
-
- formatvol -vol 8000 -vol 4000
-
-
- Formatting a volume writes a volume header and initializes the
- bitmaps that describe the free blocks on the volume. A volume
- that is reformatted after being used loses all its data.
-
- The Storage Manager does not prevent a volume from being format-
- ted while it is in use by a server, even though it will cause the
- server to crash unrecoverably. Be certain that a volume is not
- mounted before you format it! [5] A volume is unmounted when all
- clients that are using the volume have completed transactions on
- it and have unmounted it. (A client may unmount a volume expli-
- citly with sm_DismountVolume( ), or by shutting down with
- sm_ShutDown( ) or exit( ).)
-
- During recovery, a server mounts the volumes that need recovery.
- The volumes are dismounted when recovery is completed. If a
- volume was in use at the time its server crashed, do not reformat
- the volume until a new server recovers the data on that volume.
- If you do, the server's log will be inconsistent with the data on
- ____________________
- [4] The format of a volume does not affect performance with
- most modern disks. The easiest way to format volumes it to use
- use 1 cyl, 1 track/cyl, and let the sect/trk account for the size
- of the entire volume.
- ____________________
-
-
-
- 60
-
-
-
-
-
-
-
-
-
-
- the volume, and the server will crash during recovery, and it
- will be unable to recover from that. You can reformat the data
- volumes and the log volume to get a server running again, but you
- will have lost all data on the volumes.
-
- The log volume is mounted whenever the server is running, so a
- log volume can be formatted ONLY when the server is not running.
-
- 5.1.6. Size Requirements for Log Volumes
-
- How large should a log volume be? The answer depends on the
- expected transaction mix. More specifically, it depends on the
- age of the oldest (longest running) transaction in the system and
- the amount of log space used by all active transactions. Here are
- some general rules to determine the amount of free log space
- available in the system.
-
- (1) The physical log is circular. Log space between the first
- log record generated by the oldest active transaction and
- the most recent log record generated by any transaction
- cannot be reused.
-
- (2) Log space for a transaction is available for reuse when
- the transaction has committed or completely aborted.
- Aborting a transaction causes log space to be used, so
- space is reserved for aborting each transaction. Enough
- log space must be available to commit or abort all active
- transactions at all times.
-
- (3) Only space starting at the beginning of the log can be
- reused. This space can be reused if it contains log
- records only for transactions meeting rule 2.
-
- (4) All sm_WriteObject( ) calls require log space twice the
- size of the space written in the object. All calls that
- create, grow, or shrink objects require log space equal to
- the size created, inserted, or deleted. Log records gen-
- erated by these calls (generally one per call) have an
- overhead of approximately 50 bytes.
-
- (5) File operations are logged, but the space requirements for
- them are most often negligible, since they are relatively
- rare operations, and are often performed in short transac-
- tions.
-
-
- ____________________
- [5] The Storage Manager ought to lock volumes with Unix file
- locks, but Unix does not provide an adequate mechanism for
- locking and unlocking files in the context of crash recovery.
-
-
-
-
-
- 61
-
-
-
-
-
-
-
-
-
-
- (6) The amount of log space reserved for aborting a transac-
- tion is equal to the amount of log space generated by the
- transaction (for the purpose of committing the transac-
- tion).
-
- (7) When insufficient log space is available for a transac-
- tion, the transaction is aborted.
-
- (8) The log should be at least 1 Mbyte (250 pages).
-
- For example, consider a transaction T1, which creates 300 objects
- of size 2,000 bytes, writes 20 bytes in 100 objects, and is com-
- mitted. T1 requires at 615 Kbytes for the creates and 9 Kbytes of
- log space for the writes. Since log space must be reserved to
- abort the transaction, the log size must be over 1.248 Mbytes to
- run this transaction. Assuming T1 is the only transaction running
- in the system, all the log space it uses and reserves becomes
- available when it completes. If another transaction, T2, is
- started at the same time as T1, but is still running after T1 is
- committed, only the reserved space for T1 is available for other
- transactions. The portion of the log used by T1 and T2 is not
- available until T2 is finished.
-
- Transactions that fail because of insufficient log space are com-
- monly those that load a large number of objects into a file dur-
- ing the creation of a database. A solution to this problem is to
- load the file in a series of smaller transactions. When the last
- transaction is committed, the load is complete. If the load needs
- to be aborted, a separate transaction is run to destroy the file.
-
- 5.1.7. Backing Up Volumes
-
- The Storage Manager does not support media recovery, so backing
- up critical data volumes is wise. A volume may be backed up when
- it is unmounted and needs no recovery. If a volume is stored on
- a Unix file, a simple copy of the file can be used as a backup.
- For volumes stored on a raw disk partition, the Unix dd(1) com-
- mand can be used to backup the volume to a Unix file and to
- restore it. For example, to save a copy of the raw device
- /dev/rrz4d in the Unix file backup.rrz4d use:
-
- dd if=/dev/rrz4d of=backup.rrz4d.
-
- To restore the backup, use:
-
- dd if=backup.rrz4d of=/dev/rrz4d.
-
-
-
-
-
-
-
-
-
- 62
-
-
-
-
-
-
-
-
-
-
- 5.2. Using the Server
-
- In this section we explain how to operate a Storage Manager
- server. For the purpose of this discussion, we use only one
- server, although any number of servers can be used to manage any
- number of volumes. We begin with starting and configuring the
- server. Next, we discuss what the server does during normal
- operation. We follow this with instructions for shutting the
- server down. Finally, we explain how the server recovers from
- failure.
-
- 5.2.1. Starting the Server
-
- The server is composed of two executable files: sm_server and
- diskrw. Sm_server is the main server program. Diskrw is started
- by the server, as a separate process for each mounted volume, for
- performing asynchronous disk I/O. These processes communicate
- with the server through sockets, semaphores, and shared memory.
- By default, the server assumes diskrw is located in the user's
- path. An option, described below, can be used to change this
- assumption.
-
- When the server is started, it processes configuration options.
- These options are discussed further below. Second, the server
- allocates the buffer pool. The buffer pool is located in shared
- memory, so the operating system must have shared-memory support.
- Furthermore, the machine on which the server runs must have
- enough shared memory to accommodate the entire buffer pool. If
- not enough shared memory is available, the server prints a mes-
- sage, indicating how much shared memory it is trying to acquire,
- and exits.
-
- Third, the server mounts the log volume. If the log volume is
- newly formatted, it is regenerated. When a log volume is regen-
- erated, the entire log is cleared and written to disk. This will
- take noticeable time if the volume is large. If the log is not
- regenerated, recovery analysis is performed.
-
- If no volumes require recovery, all phases of recovery complete
- in less than one second. If the analysis determines that any
- volumes require recovery (due to a previous failure of some sort:
- operating system failure, machine failure, internal error, or
- because a user killed the server), recovery is performed. Data
- volumes that were mounted at the time of the failure are
- remounted, updates by committed transactions are restored, and
- all transactions in progress at the time of failure are aborted.
- When recovery is complete, the data volumes are dismounted and a
- checkpoint is taken.
-
- The server now begin to process requests from clients.
-
-
-
-
-
- 63
-
-
-
-
-
-
-
-
-
-
- 5.2.2. Configuring the Server
-
- There are several configuration options that can be set when the
- server is started. A brief description of the options is given in
- Table 3. Most options have default values, but some do not, and
- these must be given values, either on the command line or in a
- configuration file. See Section 3 for general information that
- applies to all options.
-
- Option values are read from the the default configuration files
- /usr/lib/sm_config, $HOME/.sm_config, and ./.sm_config in that
- order, if they exist. If the command-line option "skipdefault" is
- given, these default files are not read.
-
- Options on the command line are read after the default files are
- read. Command-line options are prefixed by a "-". In addition to
- options, a server accepts the command-line flags given in Table
- 4. Command-line flags are prefixed by a "-".
-
- When given the "help" flag, a server prints a list of the avail-
- able options and flags, and exits.
-
- The "skipdefault" flag prevents a server from reading the default
- configuration files. It must be the first argument on the command
- line if it is used.
-
- The "force" flag prevents a server from checking with the user
- before regenerating the log.
-
- The "background" flag causes the server to disconnect from its
- controlling terminal. This flag is available for users who run
- the server from shells that, like the Bourne shell, do not have
- real job control.
-
- We now describe each option from Table 2.
-
- The "config" option specifies a configuration file to read after
- default configuration files have been read. This option is
- effective only on the command line.
-
- The "verbose" option is used to turn on and off printing of the
- option values at startup. Options are printed to the file speci-
- fied by "errorfile" option (q.v.).
-
- The "bufpages" option indicates the number of MIN_PAGESIZE pages
- to be used for a server's buffer pool. The option must be given
- for a server to run. This option determines the size of the
- shared memory segment allocated by the server. The shared memory
- segment will be MIN_PAGESIZE*bufpages bytes long plus a few KB
- extra. Section 5.3, Tuning the Server, for more information
- about setting this option.
-
-
-
-
- 64
-
-
-
-
-
-
-
-
-
-
-
-
- _______________________________________________________________________________________________________
- Option Option Possible Default Option
- Name Type Values Values Description
- _______________________________________________________________________________________________________
- config string file name /usr/lib/sm_config read a configuration file
- $HOME/.sm_config defaults is read unless
- ./.sm_config skipdefault is set
- verbose Boolean yes no no print configuration options
- bufpages int > 32 none number of buffer pool pages
- logvolume string path name none name of the log volume
- portname string name or number exodussm port name or port number
- for a server; if a name, it
- must be in /etc/services
- errorfile string file name - (stderr) file for errors,
- warnings, progress
- regenlog Boolean yes no no clear the log,
- shutdown Boolean yes no no shut down after recovery
- or regeneration of log
- checkpoints int > 1 100 checkpoint frequency
- (based on number of log pages)
- diskproc string file name /usr/lib/exodus/diskrw disk I/O program name
- intercache Boolean yes no yes allow caching of pages
- at the client between
- transactions
- progress Boolean yes no no control progress printing
- maxclients int > 0 20 maximum number of
- clients to be served
- simultaneously
- maxthreads int > 1 function(maxclients) maximum number of
- threads.
- traceflags int hex number 0x0 set tracing flags.
- Available if server is
- compiled with -DDEBUG.
- tempformat string see Table 2.
- dataformat string see Table 2.
- logformat string see Table 2.
- maxaddvolumes int small number >= 0 0 increases volume table size
- wrapcount int >=0 0 starting wrap count for log
- _______________________________________________________________________________________________________
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Table 3: Server Options
-
-
-
-
-
-
-
-
-
-
-
- 65
-
-
-
-
-
-
-
-
-
-
- ______________________________________________________________
- Flag Flag
- Name Effect
- ______________________________________________________________
- help print a message and exit
- skipdefault do not read default configuration files
- must be the first argument on the command line
- force do not confirm log regeneration option
- background put in background (for use with Bourne shell)
- ______________________________________________________________
- |
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
- Table 4: Server Command-Line Flags
-
- The "logvolume" option gives the path name of the volume that
- contains the log. A value must be given for the log volume.
-
- The "portname" option indicates a port number or the symbolic
- name of a port entry in /etc/services. The server connects to
- this port and listens for client requests on it. To enable
- clients to locate a server with a symbolic port name, the port
- name must to present in /etc/services on both the client and
- server machines. If no port name is given, a server looks for an
- entry "exodussm", registered for use with TCP, in /etc/services.
-
- By using port numbers instead of symbolic names avoids the need
- for entries in /etc/services. See the Unix manual page for ser-
- vices(5). An example entry for the default server name is:
-
- exodussm 1152/tcp # exodus storage manager
-
-
- The "errorfile" option directs server error messages and diagnos-
- tics to the given file. A value of "-" means that stderr is used.
-
- The "regenlog" option causes the log on the log volume to be
- regenerated. This overwrites all log records, so it should not be
- done unless the server was last shut down cleanly. Server
- automatically regenerate their logs when they are started with a
- newly formatted log volumes. When the option is set to "yes", a
- confirmation is requested. The confirmation can be disabled by
- starting the server with the "force" option.
-
- The "shutdown" option causes a server to shut down immediately
- after performing recovery or regenerating the log.
-
- The "checkpoints" option sets the checkpoint frequency for a
- server. The value represents the number of log pages written
- between checkpoints.
-
- The "progress" option causes a server to print messages tracing
- its progress. This is used for debugging; it slows the server.
-
-
-
- 66
-
-
-
-
-
-
-
-
-
-
- The "diskproc" option specifies the path name of the disk I/O
- program to be used by the server.
-
- The "intercache" option allows experiments to be run with and
- without inter-transaction caching of pages on the client.
-
- The "maxclients" option determines the number of clients a server
- can server at any one time. Servers create internal tables whose
- size depends on this value.
-
- The "maxthreads" value, determined by the "maxclients" value,
- should be sufficient, but can be overridden. If a server recov-
- ers from a failure without running out of threads, it has enough
- threads to handle client requests. If numerous distributed tran-
- sactions are active at the time of a server failure, it is possi-
- ble, but unlikely, that the server will not be able to recover
- with the default number of threads.
-
- The "traceflags" option is available only with a server that was
- compiled with debugging (the -DDEBUG flag). It is useful for
- programmers who are modifying the Storage Manager source code and
- testing their changes.
-
- The "dataformat", "logformat", and "tempformat" options are as
- described in Section 5.1.5, Formatting Volumes. Servers can
- mount and use volumes given in these options.
-
- The "maxaddvolumes" option indicates how large the mount table
- will be. The server reads its configuration files, counts the
- volumes named in the format options, and creates a mount table
- large enough to mount this many volumes and "maxaddvolumes" more.
- This is a strict limit to the number of volumes that the server
- can mount (at any one time) as long as it is running. The value
- of "maxaddvolumes" should not be boosted frivolously, because
- the size of the mount table affects the amount of shared memory
- required by the server. The default value is 0.
-
- The "wrapcount" option is rarely needed. The server will tell
- you if you ever need to set this option. It is needed if you add
- volumes after the server starts (maxaddvolumes > 0), and a volume
- that you are add was updated by a server running on a log that
- differs from the current log (or the log was regenerated since
- the added volume was last mounted.)
-
-
- 5.2.3. Normal Operation of Servers
-
- During normal operation, servers listen for connections and
- requests from clients and monitor terminal input. Error mes-
- sages are printed on the servers terminals when interesting
- events occur, for example, when a deadlock is detected, or a
- transaction is aborted by a server because of a problem such as
-
-
-
- 67
-
-
-
-
-
-
-
-
-
-
- insufficient log space.
-
- 5.2.3.1. Server Commands
-
- The following commands can be invoked from the standard input to
- the server: "help", "shutdown", "kill", "crash", "checkpoint",
- "printstats", "clearstats", "progress", "user", "addvolume",
- "rmvolume", "listvolumes", "listmount", "listdistr", "source",
- "redirect". When the server is compiled with profiling (-
- DPROFIL, -p), the server accepts the "profil" command. When the
- server is compiled with debugging (-DDEBUG), the server also
- accepts the "traceflags" and "tracelevel" commands.
-
- The "help" command provides a list of the commands.
-
- The "shutdown" command instructs the server to abort all active
- transactions and cleanly shut down. The "kill" command causes
- the server to halt immediately after displaying the status of
- mounted volumes. The "crash" command has the same effect as the
- "kill" command, except that a core dump is produced as well.
-
- The "checkpoint" command causes the server to take a checkpoint
- immediately. Checkpoints are taken periodically by servers. The
- default frequency is once every 100 log pages, but this can be
- changed by an application program (see
- sm_ChangeCheckpointFrequency( ) in Section 4.11.2, Administrative
- Operations).
-
- The "printstats" command prints general server statistics. The
- "clearstats" command clears any counters among the statistics.
-
- The "progress" command reverses the value of the "progress"
- option.
-
- The "user" command reverses the value of an internal flag that
- determines whether or not the server prints a message when a user
- (application) error is encountered. (There is no option to con-
- trol this.)
-
- The "addvolume" command adds a volume to the server's table of
- mountable volumes. The "addvolume" command takes a format-option
- name and a format-option value. For example, to add the data
- volume 8000, type
-
- addvolume dataformat /path/to/datafile:8000:1:1:300
-
- A volume cannot be added if the given format information con-
- flicts with other information in the table.
-
- The "rmvolume" command removes a volume from the server's table
- of mountable volumes. The command takes a volume identifier.
- For example, to remove the data volume 8000, type
-
-
-
- 68
-
-
-
-
-
-
-
-
-
-
- rmvolume 8000
-
- A volume cannot be removed if it is in use.
-
- The "listvolumes" command prints the server's table of mountable
- volumes.
-
- The "listmount" command prints a list of the volumes that are in
- some state of use: mounted, being mounted or being dismounted.
- It also prints the number of free "mount slots", which indicates
- how many more volumes could be mounted at any one time, given the
- server's configuration. To allow more volumes to be mounted at
- once, shut the server down, boost the value of the "maxaddvo-
- lumes" option, and restart the server.
-
- The "listdistr" command prints information about prepared distri-
- buted transactions. These transactions consume space in the log,
- and if they are not aborted or committed, eventually the server
- will fail because it will have run out of log space. See Section
- 4.3, Transactions, Section 4.11.1, External Two-Phase Commit
- Functions for information about distributed transactions.
-
- The "source" command takes one argument, the path name of a file
- from which to read commands. The server processes these com-
- mands, and when it reads the last command in the file, it resumes
- reading from the terminal. If the path name is missing or is
- /dev/tty, reading resumes from the terminal.
-
- The "redirect" command takes two arguments. The first argument
- indicates which output stream is to be redirected: messages to
- the terminal or error messages. The second argument is the path
- name of a file to which the output is written. When the output
- is redirected again, the stream is flushed to the given file and
- the file is closed. To redirect output to the terminal, use
- /dev/tty or omit the path name.
-
- The "profil" command causes the server to dump its profiling
- information to disk. This command is available only on a server
- that was compiled with profiling on (-DPROFIL -p). See the
- manual page for prof(1).
-
- The "traceflags" command may take an integer argument, which may
- be a hexadecimal number, such as "0xfa3", in which case it sets
- the server's trace flags word to that value. The command is
- available only with a server that was compiled with debugging on
- (-DDEBUG -g). The meanings of the trace flags are found in the
- server's source code, in src/include/global_trace.h. When "tra-
- ceflags" is used with no argument, it prints the value of the
- trace flags word.
-
- The "tracelevel" command is available with a server that was
- compiled with debugging on (-DDEBUG -g). When used with no
-
-
-
- 69
-
-
-
-
-
-
-
-
-
-
- argument, it prints the trace level for the trace flags that are
- on. When given an integer argument (1, 2, or 3), it sets the
- trace level for the trace flags that are on.
-
- 5.2.4. Shutting Down the Server
-
- The server can be shut down several ways. One method is to use
- one of the above-mentioned commands. Another is to run the
- "shutserver" program, described below, at the end of this sec-
- tion. A third way to shut down a server is to call
- sm_ShutdownServer( ) in a client program.
-
- A server may also shut itself down because of a fatal error, such
- as the unexpected death of a disk process or a bug. A fatal
- error causes the server to report the state of all the mounted
- volumes, dump core, and exit.
-
- The server allocates a Unix System V shared-memory segment and a
- semaphore set when it starts. If a server is shut down in a con-
- trolled fashion, it removes the segment and semaphore set. These
- resources are not removed when the server is terminated by kill
- -9 <server process> typed in the shell, by the "kill" or "crash"
- command given to the server's terminal monitor, or when the
- server process is killed by a debugger. If you use any one of
- these means to terminate a server, you must use ipcrm(1) to
- remove the resources. See the manual pages for ipcs(1) and
- ipcrm(1) for more information. If the segments and semaphore
- sets are not removed, eventually the operating system will run
- out of segments, and you will be unable to start a new server.
-
- If a server shuts down without having committed or aborted all
- its active transactions and flushed all its dirty pages to disk,
- recovery is required when the server is restarted. When a server
- shuts down, it prints the status of all the mounted volumes. It
- indicates if recovery is necessary on those volumes.
-
- 5.2.4.1. Running the Shutserver program
-
- The shutserver program is invoked:
-
- shutserver [-m machine] [-s servername] [-h].
-
- The "machine" specifies the name of the machine on which runs the
- server to be shut down. If "-m machine" is not given, the program
- uses the machine on which shutserver is executed. The "server-
- name" is the name of the server in /etc/services, If "-s server-
- name" is not given, "exodussm" is used. The "-h" option prints a
- brief help message.
-
-
-
-
-
-
-
- 70
-
-
-
-
-
-
-
-
-
-
- 5.2.5. Recovery
-
- When a server is started after a failure it automatically per-
- forms recovery. The time it takes for recovery depends on several
- factors, including the number of transactions in progress at the
- time of the failure, the number of log records generated by these
- transactions, and the number of log records generated since the
- last checkpoint.
-
- Recovery has three phases. After each phase, the server prints
- information about the time and I/O operations required to perform
- the phase.
-
- The first phase is analysis. The log is scanned to determine what
- transactions were active and which volumes were mounted at the
- time of the failure.
-
- After analysis, the volumes are mounted and the redo phase is
- performed. In the redo phase, data are restored to their state at
- the time of the failure.
-
- In the last phase, the undo phase, the server aborts the transac-
- tions that were active at the time of the crash. The volumes are
- dismounted, and a checkpoint is taken.
-
- For details of recovery in the Storage Manager, see [Fran92].
-
- 5.3. Tuning the Server
-
- There are several tuning parameters in the Storage Manager
- server. The following sections describe each one.
-
- 5.3.1.1. The Size of the Buffer Pool
-
- The size of a server's buffer pool is determined by the "buf-
- pages" option, which indicates the number of MIN_PAGESIZE pages
- in the buffer pool. If a server is the primary process on a
- machine, it should have a buffer pool close to the size of avail-
- able shared memory. When both an application and a server are
- running on the same machine, choosing a buffer pool size is more
- difficult. A "proper" choice depends on the behavior of the
- applications and their interactions with servers. A good rule of
- thumb is that that clients should have the adequate buffer space,
- to minimize client-server interaction.
-
- The buffer pool must fit in the available shared memory of the
- machine on which the server runs. The server will let you know
- if it cannot acquire enough shared memory when it starts. See
- the manual pages for ipcs(1) and ipcrm(1) to find out how much
- shared memory is in use. See your system administrator to find
- out how much shared memory has been configured for your systems
- if you find that you cannot run a server with a buffer pool of
-
-
-
- 71
-
-
-
-
-
-
-
-
-
-
- adequate size, and no shared memory segments are being wasted.
-
- 5.3.1.2. The Size of Log Pages
-
- The log page size is determined when a log volume is formatted.
- For a transaction mix dominated by transactions that generate
- more than a few kilobytes of log information, the larger the log
- page size, the better. For short running transactions, such as
- those found in transaction processing benchmarks, 8 Kbyte log
- pages give good results.
-
- 5.3.1.3. Checkpoint Frequency
-
- The checkpoint frequency is based on the number of log pages
- written. The default frequency is every 100 log pages. The fre-
- quency can be determined by setting the "checkpoint" configura-
- tion option. It can be changed in a running server by an appli-
- cation that calls sm_ChangeCheckpointFrequency( ). More frequent
- checkpoints tend to shorten the time required to recover after a
- server fails at the expense of processing time during normal
- operation. Checkpoints also cause the server's dirty pages to be
- flushed to disk, which may also improve performance during normal
- operation.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 72
-
-
-
-
-
-
-
-
-
-
- 6. REFERENCES
-
-
- [Care86] M. Carey, D. DeWitt, J. Richardson, and E. Shekita,
- Object and File Management in the EXODUS Extensible
- Database System, Proc. of the 1986 VLDB Conf., Kyoto,
- Japan, Aug. 1986.
-
- [Care89] M. Carey, D. DeWitt, E. Shekita, Storage Management for
- Objects in EXODUS, Object-Oriented Concepts, Databases,
- and Applications, W. Kim and F. Lochovsky, eds.,
- Addison-Wesley, 1989.
-
- [Chou85] H. Chou and D. Dewitt, An Evaluation of Buffer Manage-
- ment Strategies for Relational Database Systems, Proc.
- of the 1985 VLDB Conf., Stockholm, Sweden, Aug. 1985.
-
- [Fran92] M. Franklin, M. Zwilling, C.K.Tan, M. Carey, and D.
- DeWitt, Crash Recovery in Client-Server EXODUS, Proc.
- of the ACM SIGMOD Int'l. Conf. on Management of Data,
- San Diego, CA, June 1992.
-
- [Gray78] J. N. Gray, Notes on Database Operating Systems, Lec-
- ture Notes in Computer Science 60, Advanced course on
- Operating Systems, ed. G. Seegmuller, Springer Verlag,
- New York 1978.
-
- [Gray88] J. Gray, R. Lorie, G. Putzolu, I. Traiger, Granularity
- of Locks and Degrees of Consistency in a Shared Data
- Base, Readings in Database Systems, ed. M. Stonebraker,
- Morgan Kaufmann, San Mateo, Ca., 1988.
-
- [Litw88] W. Litwin, Linear Hashing: A New Tool for File and
- Table Addressing, Readings in Database Systems, ed. M.
- Stonebraker, Morgan Kaufmann, San Mateo, Ca., 1988.
-
- [Moha83] C. Mohan, B. Lindsay, Efficient Commit Protocols for
- the Tree of Processes Model of Distributed Transac-
- tions, Proc. 2nd ACM SIGACT/SIGOPS Symposium on Princi-
- ples of Distributed Computing, Montreal, Canada,
- August, 1983.
-
- [Moha89] C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P.
- Schwarz, ARIES: A Transaction Recovery Method Support-
- ing Fine-Granularity Locking and Partial Rollbacks
- Using Write-Ahead Logging, ACM Transactions on Database
- Systems, Vol. 17, No 1, March 1992.
-
- [Rich87] J. Richardson and M. Carey, Programming Constructs for
- Database System Implementation in EXODUS, Proc. of the
- ACM SIGMOD Int'l. Conf. on Management of Data, San
- Francisco, CA, May 1987.
-
-
-
- 73
-
-
-
-
-
-
-
-
-
-
- [exoArch] EXODUS Storage Manager Architecture Overview, unpub-
- lished, included in EXODUS Storage Manager software
- release.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 74
-
-
-
-
-
-
-
-
-
-
- A. APPENDIX : Locking Protocol for Storage Manager Operations
-
- The Storage Manager performs concurrency control using the stan-
- dard hierarchical two-phase locking protocol (see [Gray78],
- [Gray88]) for locking files and object pages. The lock hierarchy
- contains two granularities: file-level, and page-level. Locking
- for index operations is performed with a non-two-phase protocol,
- that allows multiple clients to read and update the same index.
- This section describes the lock modes used in the system, lists
- the locks requested for each Storage Manager file and object
- operation, and explains how deadlocks are handled. Lock acquisi-
- tion and release are implicit in all relevant operations, so
- clients cannot explicitly manage their own locks.
-
- A.1. Lock Modes
-
- Files are locked in one of six modes: no lock (NL), shared (S),
- exclusive (X), intent to share (IS), intent to exclusive (IX),
- share with intent to exclusive (SIX) [Gray78], [Gray88]. Only
- shared and exclusive locks are obtained on pages. Determining
- whether two locks are compatible (eg., when a client holds a lock
- on a file and another client wants to obtain a lock on it as
- well) can be done using a table. Table A.1 is a lock compatibil-
- ity table for the six file lock modes. Each row indicates a lock
- that some client can hold, and each column indicates a lock
- desired by another client. The Y and N table entries indicate
- (yes or no) whether the locks are compatible or not.
-
-
-
- ___________________________________
-
- Lock Lock Requested
- Held NL IS IX S SIX X
- ___________________________________
- NL Y Y Y Y Y Y
-
- IS Y Y Y Y Y N
-
- IX Y Y Y N N N
-
- S Y Y N Y N N
-
- SIX Y Y N N N N
-
- X Y N N N N N
- ___________________________________
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Table A.1: Lock Compatibility
-
-
-
-
-
- 75
-
-
-
-
-
-
-
-
-
-
- Another table can be used to express lock convertibility. A lock
- conversion occurs when a client holds a lock in some mode and
- requests an operation that requires a different mode for the
- lock. Table A.2 is a lock convertibility table for the six file
- lock modes. Each row indicates a lock that the client already
- holds and each column indicates the new lock mode requested. The
- entries represent the resulting lock mode obtained.
-
-
- A.2. Locks Obtained by Operations
-
- The locks mentioned above are obtained on two types of structures
- in the Storage Manager: files and pages. Only the pages that con-
- tain object headers and root entries are locked; large object
- data pages and file index pages are not locked. The entire root
- entry page is locked when a root entry is used.
-
- Table A.3 lists all of the locks obtained by the various Storage
- Manager operations. The column labelled "File Lock" indicates
- what lock mode is used for locking the file in question. The
- column labelled "Page Lock" indicates what lock mode is used for
- locking pages containing the objects or root entries in question.
- Locks are held until the end of the transaction in which they
- were acquired.
-
- Some applications may find it necessary to acquire more restric-
- tive locks on pages and files to avoid conflicts during lock-
- upgrade requests. For example, consider an application that reads
-
-
- ________________________________________
-
- Lock Lock Requested
- Held NL IS IX S SIX X
- ________________________________________
- NL NL IS IX S SIX X
-
- IS IS IS IX S SIX X
-
- IX IX IX IX SIX SIX X
-
- S S S SIX S SIX X
-
- SIX SIX SIX SIX SIX SIX X
-
- X X X X X X X
- ________________________________________
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Table A.2: Lock Convertibility
-
-
-
-
-
- 76
-
-
-
-
-
-
-
-
-
-
- an object (with sm_ReadObject( )) and subsequently writes it
- (with sm_WriteObject( )). When the object is read, a share lock
- is acquired for the object's page. When the object is written, a
- lock-upgrade request is sent to the server to obtain an exclusive
- lock on the page. This extra message is relatively expensive and
- can lead to potential deadlock if other clients are locking the
- page as well. To avoid this problem, the "pagelock" option can
- be used to change the default lock modes used when the client
- library locks a page. See Table 1 and the discussion of client
- options in Section 4.2, Initialization and Shutdown Operations
- for information about setting client options. See Appendix A for
- more information about lock modes and the Storage Manager's lock-
- ing protocols.
-
-
- A.3. Deadlock Detection and Avoidance
-
- With each lock request, a server analyzes its local waits-for
- graph and detects local cycles, or "local deadlocks". The
- request that would cause a deadlock is denied (returns
- esmFAILURE), and the client library returns esmLOCKCAUSEDDEADLOCK
- to the application in the global variable sm_errno.
-
- Distributed transactions may also cause a deadlock. The servers
- do not detect deadlocks that involve other servers. Global
- deadlocks are avoided by timing out locks. Each request that
- awaits a lock is aged. When its age exceeds the time given by
- the client's "locktimeout" option, the request is denied (returns
- esmFAILURE), and the client library returns esmLOCKBUSY to the
- application in the global variable sm_errno.
-
- When an application's request fails with esmLOCKBUSY or
- esmLOCKCAUSEDDEADLOCK, the application must abort its transac-
- tion, to free the locks it holds, and it must start its transac-
- tion again.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 77
-
-
-
-
-
-
-
-
-
-
-
-
- ___________________________________________________________________
-
- Operation File Lock Page Lock Comments
- ___________________________________________________________________
- sm_Initialize( ) - - no locks needed
- sm_ShutDown( ) - - no locks needed
- sm_OpenBufferGroup( ) - - no locks needed
- sm_CloseBufferGroup( ) - - no locks needed
-
- sm_SetRootEntry( ) - X root entry page
- sm_GetRootEntry( ) - S root entry page
- sm_RemoveRootEntry( ) - X root entry page
-
- sm_CreateFile( ) X -
- sm_DestroyFile( ) X -
-
- sm_GetFirstOid( ) S -
- sm_GetLastOid( ) S -
- sm_GetNextOid( ) S -
- sm_GetPreviousOid( ) S -
-
- sm_OpenScan( ) S -
- sm_OpenScanWithGroup( ) S -
- sm_ScanNextObject( ) - - no locks needed
- sm_CloseScan( ) - - no locks needed
-
- sm_OpenLoad( ) X -
- sm_LoadNextObject( ) - - no locks needed
- sm_CloseLoad( ) - - no locks needed
-
- sm_CreateObject( ) IX X unordered file
- sm_DestroyObject( ) IX X
- sm_ReadObject( ) IS S
- sm_ReadObjectHeader( ) IS S
- sm_ReleaseObject( ) - - no locks needed
- sm_WriteObject( ) IX X
- sm_InsertInObject( ) IX X
- sm_AppendToObject( ) IX X
- sm_DeleteFromObject( ) IX X
-
- sm_CreateVersion( ) IX X
- sm_FreezeVersion( ) IX X
- ___________________________________________________________________
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Table A.3: Locks Obtained by Operations
-
-
-
-
-
-
-
- 78
-
-
-
-
-
-
-
-
-
-
- B. APPENDIX : Generation of Unique Numbers for OIDs
-
- The "unique" field of an OID is special 32-bit value that is gen-
- erated when the object is created and used to detect instances
- where the OID has become dangling or corrupted. The values that
- are stored in "unique" fields are generated by Storage Manager
- servers. Disk volumes are partitioned into blocks of 32 pages,
- and for each partition a 32-bit counter is maintained. When a new
- page is allocated, it is allotted a range (100) of unique numbers
- to use during object creation. The counter in the partition con-
- taining the new page is incremented to reflect the allotment.
- When this allotment has been exhausted, a request is made to the
- server for another allotment. When an object is created in a par-
- ticular partition, the "unique" field of the new object's OID is
- set to the next available number in the range on the page. While
- this strategy does not guarantee that OIDs are unique for all
- time, the probability of a dangling OID that maps to the same
- page and the same slot, and has the same "unique" field as a
- valid OID is very low. As a result, "unique" fields can be used
- virtually to guarantee the validity of an OID. We adopted this
- approach instead of using unique-for-all-time logical OIDs with a
- surrogate index in order to avoid the extra disk I/Os that might
- be needed to translate a logical OID to a physical address.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 79
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- 80
-
-
-
-
-
-
- TABLE OF CONTENTS
-
-
- 1 INTRODUCTION ............................................ 1
- 2 OVERVIEW OF THE EXODUS STORAGE MANAGER .................. 1
- 2.1 Architecture .................................. 1
- 2.2 Facilities .................................... 2
- 2.2.1 Objects ............................. 2
- 2.2.2 Versions ............................ 3
- 2.2.3 Files ............................... 3
- 2.2.4 Indexes ............................. 3
- 2.2.5 Volumes ............................. 3
- 2.2.6 Transactions ........................ 4
- 2.2.7 Concurrency Control ................. 4
- 2.2.8 Recovery ............................ 4
- 2.2.9 Configuration Options ............... 4
- 2.3 Illustration of Using the Storage Manager ..... 4
- 2.3.1 Files Needed ........................ 5
- 2.3.2 Preparing Your Disks ................ 6
- 2.3.3 Configuring a Server ................ 7
- 2.3.4 Compiling and Linking Your
- Application ........................................... 8
- 2.3.5 Configuring and Running Your
- Application ........................................... 8
- 2.3.6 Shutting Down the Server ............ 9
- 3 CONFIGURATION OPTIONS AND CONFIGURATION FILES ........... 11
- 4 THE STORAGE MANAGER APPLICATION INTERFACE ............... 13
- 4.1 Handling Errors ............................... 13
- 4.2 Initialization and Shutdown Operations ........ 15
- 4.3 Transactions .................................. 21
- 4.3.1 Transaction Identifiers ............. 21
- 4.3.2 Transaction States .................. 22
- 4.3.3 Transaction Operations .............. 22
- 4.4 Mounting and Dismounting Volumes .............. 24
- 4.5 Root Entries .................................. 25
- 4.6 Buffer Operations ............................. 26
- 4.7 Operations on Objects ......................... 28
- 4.7.1 Creating and Destroying Objects ..... 31
- 4.7.2 Pinning and Unpinning Objects ....... 32
- 4.7.3 Modifying Objects ................... 33
- 4.7.4 Object Headers ...................... 35
- 4.8 Versions of Objects ........................... 36
- 4.9 Operations on Files ........................... 37
- 4.9.1 Consistency Guarantees for Files
- ....................................................... 38
- 4.9.2 Scanning Files ...................... 40
- 4.9.3 Bulk-loading Files .................. 43
- 4.10 Operations on Indexes ........................ 43
- 4.10.1 Creating and Destroying Indexes
- ....................................................... 44
- 4.10.2 Inserting and Removing Index
- Elements .............................................. 46
- 4.10.3 Loading Indexes in Bulk ............ 46
- 4.10.4 Scanning Indexes ................... 47
- 4.11 Advanced Topics .............................. 49
- 4.11.1 External Two-Phase Commit
-
-
-
- i
-
-
-
-
-
-
-
-
-
-
- Functions ............................................. 49
- 4.11.2 Administrative Operations .......... 53
- 4.11.3 Tuning the Application ............. 57
- 5 USING STORAGE MANAGER SERVERS ........................... 58
- 5.1 Managing Volumes .............................. 58
- 5.1.1 Log Volumes ......................... 58
- 5.1.2 Data Volumes ........................ 58
- 5.1.3 Temporary Volumes ................... 58
- 5.1.4 Raw Partitions and Unix Files ....... 59
- 5.1.5 Formatting Volumes .................. 59
- 5.1.6 Size Requirements for Log Volumes
- ....................................................... 61
- 5.1.7 Backing Up Volumes .................. 62
- 5.2 Using the Server .............................. 63
- 5.2.1 Starting the Server ................. 63
- 5.2.2 Configuring the Server .............. 64
- 5.2.3 Normal Operation of Servers ......... 67
- 5.2.3.1 Server Commands ........... 68
- 5.2.4 Shutting Down the Server ............ 70
- 5.2.4.1 Running the Shutserver
- program ............................................... 70
- 5.2.5 Recovery ............................ 71
- 5.3 Tuning the Server ............................. 71
- 5.3.1.1 The Size of the Buffer
- Pool .................................................. 71
- 5.3.1.2 The Size of Log Pages ..... 72
- 5.3.1.3 Checkpoint Frequency ...... 72
- 6 REFERENCES .............................................. 73
- A APPENDIX : Locking Protocol for Storage Manager
- Operations ............................................ 75
- A.1 Lock Modes .................................... 75
- A.2 Locks Obtained by Operations .................. 76
- A.3 Deadlock Detection and Avoidance .............. 77
- B APPENDIX : Generation of Unique Numbers for OIDs ........ 79
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- ii
-
-
-
-